Thanks to visit codestin.com
Credit goes to github.com

Skip to content

numpy lacks memory and speed efficiency for Booleans #14821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GregoryMorse opened this issue Nov 1, 2019 · 5 comments
Closed

numpy lacks memory and speed efficiency for Booleans #14821

GregoryMorse opened this issue Nov 1, 2019 · 5 comments

Comments

@GregoryMorse
Copy link

GregoryMorse commented Nov 1, 2019

Using pure Boolean values is pretty limiting when dealing with very large data due to memory waste. np.array will use 1 byte per Boolean which although better than 4 or 8, is still 8 times waste.

Easy enough to pack the Boolean value np.array as bytes (np.uint8):

def boolarr_tobytes(arr):
    rem = len(arr) % 8
    if rem != 0: arr = np.concatenate((arr, np.zeros(8 - rem, dtype=np.bool_)))
    arr = np.reshape(arr, (int(len(arr) / 8), 8))
    return np.packbits(arr) #translates boolean to bytes if array shape (n, 8) with high bits first

Then & and | provide bitwise operations already. And np.sum can be written as:

bytebitcounts = np.array([bin(x).count("1") for x in range(256)])
def totalbits_bytearr(arr):
    return np.sum(bytebitcounts[arr])

Now I am truly supposing that the table lookup which uses the imaging table translation function is vectorized properly. I would imagine since it is used heavily for image processing that it is. This would be 2 vector operations (np.sum and table lookup) instead of 1 np.sum. PSHUFB (packed bytes shuffle) is the name of the processor intrinsic which can do byte table lookup translation. However, since the AVX/SSE2 and like instructions have data limits, 8 times less vector operations would occur per vector operation. 1 vector operation * 8 vs 2 vector operations is still 4 times faster.

So if numpy would dare to add a whole data type to use packed byte representations of Booleans instead (which might be a major change which would need to be implemented), it would decrease memory by 8 times, increase vector operations by 8 times except where bit twiddling like mentioned is needed where it would depending on the specific operation still tend to be faster.

I can see no reason why this would not be highly desirable for the library especially since large datasets are pretty typical.

Yes the primary problem would be endless indexing oddities (1 << bitoffset) & value, along with if set: value |= (1 << bitoffset). But a lot of things are already implicitly supported.

Half of the operations are probably trivial like multiplication, addition as shown, and a few would require some real thinking.

It would make these Python libraries as flexibly scalable as C though so it would be impressive. This would further positively effect a great many libraries out there giving dramatic potential increases in large data sets.

@eric-wieser
Copy link
Member

eric-wieser commented Nov 1, 2019

Note that you can probably get 90% of the performance you want with numba:

import numba
lookup_table = np.array([bin(x).count("1") for x in range(256)], np.uint8)
@numba.guvectorize([(numba.uint8[:], numba.int64[:])], '(n)->()', nopython=True)
def sum_bits(x, res):
    res[0] = 0
    for xi in x:
        res[0] += lookup_table[xi]
In [41]: a = np.array([0b1100, 0b11011011], np.uint8)

In [42]: sum_bits(a)
Out[42]: 8
In [43]: %timeit sum_bits(a)
1.83 µs ± 80.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [44]: def totalbits_bytearr(arr):
    ...:     return np.sum(bytebitcounts[arr])
    ...:

In [45]: bytebitcounts=lookup_table

In [46]: %timeit totalbits_bytearr(a)
12.7 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@GregoryMorse
Copy link
Author

GregoryMorse commented Nov 1, 2019

Thank you very much for that insight and very impressive results. So numpy is not vectorizing things as nicely as was thought then? I had assumed it was... Also surprised to see the numba can take Python code with initialization and figure out how to correctly vectorize a for loop an outer operation of addition and an inner operation of a table lookup. That is impressive. So apparently this report, is mainly to add a new packedbits data type, which has a bunch of numba code for all the ufuncs. But the indexing/accessing/setting is extra bit twiddling.

I suppose bitwise & and | are already implemented using numba or could they be improved too due to their generic data type inspection? Update: 5 times slower using guvectorize than the native &, |.

@eric-wieser
Copy link
Member

eric-wieser commented Nov 1, 2019

Also surprised to see the numba can take Python code with initialization and figure out how to correctly vectorize

I think you have the wrong model of how numba works, like I used to. I think the things you need to know about the numba are that:

  • it translates python into unoptimized LLVM IR
  • it treats any closures (eg the lookup table) as compile-time constants
  • LLVM does all the heavy lifting regarding optimization, the same as is used by clang

which has a bunch of numba code for all the ufuncs.

Numpy cannot take a dependency on numba, it would make everything far too cyclic.

I suppose bitwise & and | are already implemented using numba

They're implemented using native loops in C. The problem with np.sum(bytebitcounts[arr]) is it uses two loops, an intermediate array, and no compile-time knowledge of the lookup table.

@GregoryMorse
Copy link
Author

Certainly did not know about the table lookup optimization. So sounds like it could be wrong to think that we are using the SSE2/AVX instructions in the silicon for that sort of batched parallelism unless LLVM has gotten to that level. These type of operations could even be thrown over to the GPU if dealing with large enough data sets or done with multi-threading, etc.

@seberg
Copy link
Member

seberg commented Nov 18, 2019

I am a bit sorry about it. But I will close this. I am not against helpers for speeding up or handling bit arrays better, but I do not think they fit on the current agenda/roadmap (or design) of numpy, since it would require things like bit-sized strides or similar.
This does not mean that we cannot do it, but I think it needs some careful design proposals outside of the issue tracker.

@seberg seberg closed this as completed Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants