-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
numpy lacks memory and speed efficiency for Booleans #14821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that you can probably get 90% of the performance you want with import numba
lookup_table = np.array([bin(x).count("1") for x in range(256)], np.uint8)
@numba.guvectorize([(numba.uint8[:], numba.int64[:])], '(n)->()', nopython=True)
def sum_bits(x, res):
res[0] = 0
for xi in x:
res[0] += lookup_table[xi] In [41]: a = np.array([0b1100, 0b11011011], np.uint8)
In [42]: sum_bits(a)
Out[42]: 8 In [43]: %timeit sum_bits(a)
1.83 µs ± 80.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [44]: def totalbits_bytearr(arr):
...: return np.sum(bytebitcounts[arr])
...:
In [45]: bytebitcounts=lookup_table
In [46]: %timeit totalbits_bytearr(a)
12.7 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
Thank you very much for that insight and very impressive results. So numpy is not vectorizing things as nicely as was thought then? I had assumed it was... Also surprised to see the numba can take Python code with initialization and figure out how to correctly vectorize a for loop an outer operation of addition and an inner operation of a table lookup. That is impressive. So apparently this report, is mainly to add a new packedbits data type, which has a bunch of numba code for all the ufuncs. But the indexing/accessing/setting is extra bit twiddling. I suppose bitwise & and | are already implemented using numba or could they be improved too due to their generic data type inspection? Update: 5 times slower using guvectorize than the native &, |. |
I think you have the wrong model of how numba works, like I used to. I think the things you need to know about the numba are that:
Numpy cannot take a dependency on numba, it would make everything far too cyclic.
They're implemented using native loops in C. The problem with |
Certainly did not know about the table lookup optimization. So sounds like it could be wrong to think that we are using the SSE2/AVX instructions in the silicon for that sort of batched parallelism unless LLVM has gotten to that level. These type of operations could even be thrown over to the GPU if dealing with large enough data sets or done with multi-threading, etc. |
I am a bit sorry about it. But I will close this. I am not against helpers for speeding up or handling bit arrays better, but I do not think they fit on the current agenda/roadmap (or design) of numpy, since it would require things like bit-sized strides or similar. |
Using pure Boolean values is pretty limiting when dealing with very large data due to memory waste. np.array will use 1 byte per Boolean which although better than 4 or 8, is still 8 times waste.
Easy enough to pack the Boolean value np.array as bytes (np.uint8):
Then
&
and|
provide bitwise operations already. And np.sum can be written as:Now I am truly supposing that the table lookup which uses the imaging table translation function is vectorized properly. I would imagine since it is used heavily for image processing that it is. This would be 2 vector operations (
np.sum
and table lookup) instead of 1np.sum
.PSHUFB
(packed bytes shuffle) is the name of the processor intrinsic which can do byte table lookup translation. However, since the AVX/SSE2 and like instructions have data limits, 8 times less vector operations would occur per vector operation. 1 vector operation * 8 vs 2 vector operations is still 4 times faster.So if numpy would dare to add a whole data type to use packed byte representations of Booleans instead (which might be a major change which would need to be implemented), it would decrease memory by 8 times, increase vector operations by 8 times except where bit twiddling like mentioned is needed where it would depending on the specific operation still tend to be faster.
I can see no reason why this would not be highly desirable for the library especially since large datasets are pretty typical.
Yes the primary problem would be endless indexing oddities
(1 << bitoffset) & value
, along withif set: value |= (1 << bitoffset)
. But a lot of things are already implicitly supported.Half of the operations are probably trivial like multiplication, addition as shown, and a few would require some real thinking.
It would make these Python libraries as flexibly scalable as C though so it would be impressive. This would further positively effect a great many libraries out there giving dramatic potential increases in large data sets.
The text was updated successfully, but these errors were encountered: