`log2` way slower than `log` or `log10` #13836

ZisIsNotZis · 2019-06-26T02:56:17Z

Reproducing code example:

a = np.ones(2**27, 'f')
%timeit np.log(a)
%timeit np.log2(a)
%timeit np.log10(a)

a = np.ones(2**27, 'd')
%timeit np.log(a)
%timeit np.log2(a)
%timeit np.log10(a)

Output:

74.5 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
484 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
92.3 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

181 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
693 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
186 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy/Python version information:

1.16.4 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]

Comparison

With my straight forward implementation of log2

#include <math.h>
void f(int len, float* x, float* out){
	for(int i=0; i<len; i++)
		out[i] = log2f(x[i]);
}
void d(int len, double* x, double* out){
	for(int i=0; i<len; i++)
		out[i] = log2(x[i]);
}

from ctypes import CDLL, c_uint64
log2f = CDLL('tmp/log2.so').f
log2  = CDLL('tmp/log2.so').d

a   = np.ones(2**27, 'f')
out = np.empty(2**27, 'f')
%timeit log2f(len(a), c_uint64(a.ctypes.data), c_uint64(out.ctypes.data))

a   = np.ones(2**27, 'd')
out = np.empty(2**27, 'd')
%timeit log2(len(a), c_uint64(a.ctypes.data), c_uint64(out.ctypes.data))

The result shows

239 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

421 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

So I feel something must be wrong inside np.log2 that make it slower than a naive single-threaded implementation? Theoretically it should not be slower than np.log10 anyway

The text was updated successfully, but these errors were encountered:

ZisIsNotZis · 2019-06-26T03:12:53Z

Wait I just did a full C benchmark, the results are

80.3 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
239 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
80.4 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

183 ms ± 712 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
420 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
184 ms ± 782 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Looks like the performance of log and log10 is consistent between numpy and my implementation, but log2 is way slower than them. Maybe there isn't a good enough SIMD for that?

But anyway the performance of np.log2 is indeed slower than expected.

ZisIsNotZis · 2019-06-26T03:21:07Z

Related SO: Why are log2 and log1p so much faster than log and log10?

seberg · 2019-06-26T03:43:44Z

The difference is probably that at least log has explicit AVX vectorization, and your loops the compiler probably manages to do that. I find it plausible that @r-devulap already has improvements in the pipeline. Although, we may not do it in some cases if the improvements are small.

r-devulap · 2019-06-26T04:09:46Z

float32 log in NumPy 1.17x uses AVX2/AVX512 intrinsic (see PR #13134). Although, the algorithm easily extends to log2 and log10 and its my to-do list.

ZisIsNotZis · 2019-06-26T05:11:39Z

So it might take a long time before fast log2 will be implemented in numpy?

Anyway, a quick fix would be

a = np.ones(2**27, 'f')
%timeit np.log2(a)
%timeit (lambda x:np.divide(x, np.log(2), x))(np.log(a))

480 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
160 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

A quick precision test

a  = np.random.rand(2**27).astype('f') * 1e9
np.linalg.norm(np.log2(a) - np.log(a)/np.log(2)) / len(a)

9.88357312659005e-11

seberg · 2019-06-26T17:54:26Z

@ZisIsNotZis that is not a viable "quick fix" within numpy (and probably a bad idea also, since not all arrays can be accelerated with AVX). Having it on Raghuveer's todo list is probably the best way to have it fairly quickly, which likely means it is in 1.18.
If the quick C tests made faster code, I guess the compiler was capable of optimizing things a bit. So it might be that just just creating special contiguous paths on a specific (not passed in) function is enough.

r-devulap · 2021-07-16T05:05:32Z

see #19478

seberg added 01 - Enhancement Priority: low Valid, but not for immediate attention component: numpy.ufunc and removed Priority: low Valid, but not for immediate attention labels Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`log2` way slower than `log` or `log10` #13836

`log2` way slower than `log` or `log10` #13836

ZisIsNotZis commented Jun 26, 2019

ZisIsNotZis commented Jun 26, 2019 •

edited

Loading

Uh oh!

ZisIsNotZis commented Jun 26, 2019

Uh oh!

seberg commented Jun 26, 2019

Uh oh!

r-devulap commented Jun 26, 2019

Uh oh!

ZisIsNotZis commented Jun 26, 2019 •

edited

Loading

Uh oh!

seberg commented Jun 26, 2019

Uh oh!

r-devulap commented Jul 16, 2021

Uh oh!

Uh oh!

log2 way slower than log or log10 #13836

log2 way slower than log or log10 #13836

Comments

ZisIsNotZis commented Jun 26, 2019

Reproducing code example:

Output:

Numpy/Python version information:

Comparison

ZisIsNotZis commented Jun 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZisIsNotZis commented Jun 26, 2019

Uh oh!

seberg commented Jun 26, 2019

Uh oh!

r-devulap commented Jun 26, 2019

Uh oh!

ZisIsNotZis commented Jun 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Jun 26, 2019

Uh oh!

r-devulap commented Jul 16, 2021

Uh oh!

`log2` way slower than `log` or `log10` #13836

`log2` way slower than `log` or `log10` #13836

ZisIsNotZis commented Jun 26, 2019 •

edited

Loading

ZisIsNotZis commented Jun 26, 2019 •

edited

Loading