Thanks to visit codestin.com
Credit goes to github.com

Skip to content

log2 way slower than log or log10 #13836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ZisIsNotZis opened this issue Jun 26, 2019 · 7 comments
Open

log2 way slower than log or log10 #13836

ZisIsNotZis opened this issue Jun 26, 2019 · 7 comments

Comments

@ZisIsNotZis
Copy link

Reproducing code example:

a = np.ones(2**27, 'f')
%timeit np.log(a)
%timeit np.log2(a)
%timeit np.log10(a)

a = np.ones(2**27, 'd')
%timeit np.log(a)
%timeit np.log2(a)
%timeit np.log10(a)

Output:

74.5 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
484 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
92.3 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

181 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
693 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
186 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy/Python version information:

1.16.4 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]

Comparison

With my straight forward implementation of log2

#include <math.h>
void f(int len, float* x, float* out){
	for(int i=0; i<len; i++)
		out[i] = log2f(x[i]);
}
void d(int len, double* x, double* out){
	for(int i=0; i<len; i++)
		out[i] = log2(x[i]);
}
from ctypes import CDLL, c_uint64
log2f = CDLL('tmp/log2.so').f
log2  = CDLL('tmp/log2.so').d

a   = np.ones(2**27, 'f')
out = np.empty(2**27, 'f')
%timeit log2f(len(a), c_uint64(a.ctypes.data), c_uint64(out.ctypes.data))

a   = np.ones(2**27, 'd')
out = np.empty(2**27, 'd')
%timeit log2(len(a), c_uint64(a.ctypes.data), c_uint64(out.ctypes.data))

The result shows

239 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

421 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So I feel something must be wrong inside np.log2 that make it slower than a naive single-threaded implementation? Theoretically it should not be slower than np.log10 anyway

@ZisIsNotZis
Copy link
Author

ZisIsNotZis commented Jun 26, 2019

Wait I just did a full C benchmark, the results are

80.3 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
239 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
80.4 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

183 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
420 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
184 ms ± 782 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Looks like the performance of log and log10 is consistent between numpy and my implementation, but log2 is way slower than them. Maybe there isn't a good enough SIMD for that?

But anyway the performance of np.log2 is indeed slower than expected.

@ZisIsNotZis
Copy link
Author

Related SO: Why are log2 and log1p so much faster than log and log10?

@seberg
Copy link
Member

seberg commented Jun 26, 2019

The difference is probably that at least log has explicit AVX vectorization, and your loops the compiler probably manages to do that. I find it plausible that @r-devulap already has improvements in the pipeline. Although, we may not do it in some cases if the improvements are small.

@r-devulap
Copy link
Member

float32 log in NumPy 1.17x uses AVX2/AVX512 intrinsic (see PR #13134). Although, the algorithm easily extends to log2 and log10 and its my to-do list.

@ZisIsNotZis
Copy link
Author

ZisIsNotZis commented Jun 26, 2019

So it might take a long time before fast log2 will be implemented in numpy?

Anyway, a quick fix would be

a = np.ones(2**27, 'f')
%timeit np.log2(a)
%timeit (lambda x:np.divide(x, np.log(2), x))(np.log(a))

480 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
160 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

A quick precision test

a  = np.random.rand(2**27).astype('f') * 1e9
np.linalg.norm(np.log2(a) - np.log(a)/np.log(2)) / len(a)

9.88357312659005e-11

@seberg
Copy link
Member

seberg commented Jun 26, 2019

@ZisIsNotZis that is not a viable "quick fix" within numpy (and probably a bad idea also, since not all arrays can be accelerated with AVX). Having it on Raghuveer's todo list is probably the best way to have it fairly quickly, which likely means it is in 1.18.
If the quick C tests made faster code, I guess the compiler was capable of optimizing things a bit. So it might be that just just creating special contiguous paths on a specific (not passed in) function is enough.

@seberg seberg added 01 - Enhancement Priority: low Valid, but not for immediate attention component: numpy.ufunc and removed Priority: low Valid, but not for immediate attention labels Jun 26, 2019
@r-devulap
Copy link
Member

see #19478

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants