Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ptocca
Copy link
Contributor

@ptocca ptocca commented Sep 15, 2019

Reference Issues/PRs

Fixes enhancement request #14304 "manhattan_distances for sparse matrices is slow"

What does this implement/fix? Explain your changes.

This PR affects primarily metrics/pairwise_fast.pyx.
The new version provides a faster Cython implementation of _sparse_manhattan(), but requires that the matrices have sorted indices.
I also improved the implementation of the dense matrix case, making it less memory demanding.
In both cases, the implementation is now multithreaded, using all cores available (it uses Cython's prange).

I ran existing tests via:
pytest -v metrics/tests/test_pairwise.py
There are 2 issues but they have nothing to do with what I touched.

Any other comments?

This is a first attempt, so I welcome comments and suggestions.
Questions

  1. Is it OK to make it multithreaded by default?
  2. How do I provide non-regression tests? Any tutorials/examples with recommended best practices?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cosmetics, thanks!

if i < X_indptr[px+1]: ix = X_indices[i]
if j < Y_indptr[py+1]: iy = Y_indices[j]

if ix==iy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around binary operators please

j = Y_indptr[py]
d = 0.0
while i < X_indptr[px+1] and j < Y_indptr[py+1]:
if i < X_indptr[px+1]: ix = X_indices[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't put two statements on a line
See pep8

"""
cdef double[::1] row = np.empty(n_features)
cdef np.npy_intp ix, iy, j
"""Pairwise L1 distances for CSR matrices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm may be simpler for CSC...?

Usage:
>>> D = np.zeros(X.shape[0], Y.shape[0])
>>> cython_manhattan(X.data, X.indices, X.indptr,
... Y.data, Y.indices, Y.indptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're assuming that X.has_sorted_indices

s = 0
for k in range(x.shape[1]):
s = s + fabs(x[i,k]-y[j,k])
out[i,j]=s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file should end with a newline character

j = j+1

if i== X_indptr[px+1]:
while j < Y_indptr[py+1]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use for here

D = X[:, np.newaxis, :] - Y[np.newaxis, :, :]
D = np.abs(D, D)
return D.reshape((-1, X.shape[1]))
D = np.empty(shape=(X.shape[0],Y.shape[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces after commas please

cdef np.npy_intp px, py, i, j, ix, iy
cdef double d = 0.0

cdef int m = D.shape[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need these variables.

@jeremiedbb
Copy link
Member

Is it OK to make it multithreaded by default?

We haven't made a decision yet :/

How do I provide non-regression tests?

Tests related to this are in test_pairwise.py. We already test some cases, but you can add a test to check that dense and sparse return the same result, on float32 and float64, with or without sorted indices. You should name it test_manhattan_distances. To write it you can get inspiration from other tests.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

#cython: boundscheck=False
#cython: cdivision=True
#cython: wraparound=False
# distutils: extra_compile_args=-fopenmp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need that. We take care of the OpenMP flags in the setup

row[X_indices[j]] = X_data[j]
for j in range(Y_indptr[iy], Y_indptr[iy + 1]):
row[Y_indices[j]] -= Y_data[j]
for px in prange(m):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can avoid 1 indentation by grouping nogil and prange:

for px in prange(m, nogil=True):
    ...

iy = Y_indices[j]

if ix == iy:
d = d + fabs(X_data[i] - Y_data[j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inplace operation is preferable
d += fabs(X_data[i] - Y_data[j])

Have you tried using just abs ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for inplace operations everywhere i += 1 etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using prange, the inplace operator signals a reduction (at the level of the for statement where the prange is used), which is not what I want here.

Maybe I should add a comment in the code to make that clear

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we want a thread local variable here

cdef double s = 0.0
cdef np.npy_intp i, j, k
with nogil:
for i in prange(x.shape[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you benchmark this function with the environment variable OMP_NUM_THREADS=1 (before any import) against scipy.spatial.distance.cdist(metric='cityblock') ?

@jeremiedbb
Copy link
Member

It would be nice if you could also provide a small benchmark vs master (in both single and multi threaded modes)

row[X_indices[j]] = X_data[j]
for j in range(Y_indptr[iy], Y_indptr[iy + 1]):
row[Y_indices[j]] -= Y_data[j]
for px in prange(m):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a few comments describing in English what this implementation does, as was done previously.

iy = Y_indices[j]

if ix == iy:
d = d + fabs(X_data[i] - Y_data[j])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for inplace operations everywhere i += 1 etc.

@ptocca
Copy link
Contributor Author

ptocca commented Sep 18, 2019

General comment: if you had not noticed already, I am not quite up to speed with this development process. I read around and try to do the right thing, but please do not hesitate to set me on the right path.
For instance, is it OK to push often new versions of the PR?
Should I hit the Resolve Conversation only when I have pushed a version that I think addresses the issue? Or is it up to who raised the issue to accept the resolution?

@ptocca
Copy link
Contributor Author

ptocca commented Sep 18, 2019

@jeremiedbb, below is a preliminary benchmark/regression test.
I timed three cases, each with 10 repeats.
What spurred me to work also on the dense matrix case is that in v21.2 it appears to run single-threaded and I did not like that,
Anyway, on my laptop (4 vCPU) there is not much of an advantage (multi-threading overhead? slow memory subsystem?). But I often work on HPC systems, with 24 to 32 cores and fast memory subsystems, and it seems to make a difference there.

Random sparse matrices, 10 repeats, shapes: (1000, 3000) (1500, 3000)
sklearn version: 0.21.2
real	0m14.808s
user	0m14.780s
sys	0m0.248s
Random sparse matrices, 10 repeats, shapes: (1000, 3000) (1500, 3000)
sklearn version: 0.22.dev0
real	0m3.283s
user	0m8.001s
sys	0m0.211s
np.all(np.isclose(.,.)) True
Max abs diff: 4.263256414560601e-14


Random sparse matrices, after train_test_split(), 10 repeats, shapes: (800, 3000) (1200, 3000)
sklearn version: 0.21.2
X.has_sorted_indices: 0
Y.has_sorted_indices: 0
real	0m9.819s
user	0m9.804s
sys	0m0.237s
Random sparse matrices, after train_test_split(), 10 repeats, shapes: (800, 3000) (1200, 3000)
sklearn version: 0.22.dev0
X.has_sorted_indices: 1
Y.has_sorted_indices: 1
real	0m2.321s
user	0m5.325s
sys	0m0.230s
np.all(np.isclose(.,.)) True
Max abs diff: 4.263256414560601e-14


Random dense matrices, 10 repeats, shapes: (1000, 1000) (1500, 1000)
sklearn version: 0.21.2
real	0m10.974s
user	0m10.971s
sys	0m0.213s
Random dense matrices, 10 repeats, shapes: (1000, 1000) (1500, 1000)
sklearn version: 0.22.dev0
real	0m7.572s
user	0m24.814s
sys	0m0.235s
np.all(np.isclose(.,.)) True
Max abs diff: 1.8189894035458565e-12

@jeremiedbb
Copy link
Member

For instance, is it OK to push often new versions of the PR?

I have no problem with that.

Should I hit the Resolve Conversation only when I have pushed a version that I think addresses the issue? Or is it up to who raised the issue to accept the resolution?

If you made changes that impact the lines a conversation is about, this conversation will be marked as outdated. You don't need to mark it as resolved. And it's better to let the person who started the conversation to confirm that it has been addressed correctly :)

@jeremiedbb
Copy link
Member

Regarding the benchmarks
The sparse case is very convincing. The dense case less. I'll take a look tomorrow to see what's going on with the dense case.

@jeremiedbb
Copy link
Member

About the dense case.

It's a bit subtle to get the best performance. The reduction is not vectorized by gcc. The way to vectorize it is to manually unroll the loop. Here's how to do it:

def _dense_manhattan(floating[:, :] X, floating[:, :] Y, floating[:, :] out):
    cdef:
        floating s = 0.0
        int n_samples_x = X.shape[0]
        int n_samples_y = Y.shape[0]
        int n_features = X.shape[1]
        np.npy_intp i, j

    for i in prange(n_samples_x, nogil=True):
        for j in range(n_samples_y):
            out[i, j] = _manhattan_1d(&X[i, 0], &Y[j, 0], n_features)

cdef floating _manhattan_1d(floating *x, floating *y, int n_features) nogil:
    cdef:
        int i
        int n = n_features // 4
        int rem = n_features % 4
        floating result = 0

    for i in range(n):
        result += (fabs(x[0] - y[0])
                  +fabs(x[1] - y[1])
                  +fabs(x[2] - y[2])
                  +fabs(x[3] - y[3]))
        x += 4; y += 4

    for i in range(rem):
        result += fabs(x[i] - y[i])

    return result

With this you can equal (even slightly outperform) scipy. And scalability is good.

However
there's already pairwise_distances, which supports metric='manhattan' and n_jobs, so you can already have parallel manhattan distances. Isn't it enough for you ?

@ptocca
Copy link
Contributor Author

ptocca commented Sep 19, 2019

@jeremiedbb, thanks for looking into this and for pointing out the vectorization opportunities, of which I was not aware.

Regarding the motivation for this PR, I was actually trying to improve the performance of the calculation of the laplacian kernel (in metrics/pairwise.py)), which uses the manhattan_distances() function:

def laplacian_kernel(X, Y=None, gamma=None):
    X, Y = check_pairwise_arrays(X, Y)
    if gamma is None:
        gamma = 1.0 / X.shape[1]

    K = -gamma * manhattan_distances(X, Y)
    np.exp(K, K)  # exponentiate K in-place
    return K

@jeremiedbb
Copy link
Member

Ok makes sense. In that case, what would be even faster is to re-implement the whole kernel computation.

because this

K = -gamma * manhattan_distances(X, Y)
np.exp(K, K)

is sub-optimal. It involves 2 extra loop over the distance matrix.

Re-using the above code, it would be more efficient to do something like this:

from libc.math cimport exp

def _dense_laplacian_kernel(floating[:, :] X, floating[:, :] Y, floating[:, :] out, int gamma):
    cdef:
        int n_samples_x = X.shape[0]
        int n_samples_y = Y.shape[0]
        int n_features = X.shape[1]
        np.npy_intp i, j
        floating tmp

    for i in prange(n_samples_x, nogil=True):
        for j in range(n_samples_y):
            tmp =  _dense_manhattan_1d(&X[i, 0], &Y[j, 0], n_features)
            out[i, j] = exp(-gamma * tmp)

@jeremiedbb
Copy link
Member

Anyway, I think these considerations can be moved to another PR. It will be easier to review if you stick to speed up sparse manhattan for this PR to actually fix the original issue.

@ptocca
Copy link
Contributor Author

ptocca commented Sep 20, 2019

OK, I will reinstate the existing implementation of manhattan_distances() for the dense matrix case
(By the way, the title of the PR is now inaccurate. Can one change it?) and perhaps create another PR with your suggestions?

Regarding the sparse case, there is one detail that I am not sure about.
As noted, the proposed _sparse_manhattan() assumes that the CSR matrices in input have sorted indices.
To ensure that, I sort them in-place prior to calling _sparse_manhattan().
I chose to do it in-place because I prefer to avoid copies whenever possible. But it is debatable if it is OK to modify something that is passed as an argument. What is your view?

ptocca added a commit to ptocca/scikit-learn that referenced this pull request Sep 21, 2019
@ptocca
Copy link
Contributor Author

ptocca commented Sep 21, 2019

I created PR #15049 to cover just the sparse matrix case.
I suppose this PR can be closed and perhaps any further discussion on the dense case should go into a new PR.

Base automatically changed from master to main January 22, 2021 10:51
@jeremiedbb
Copy link
Member

closing in favor of #15049 (already merged). Thanks @ptocca for your contribution !

@jeremiedbb jeremiedbb closed this Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants