numpy lacks memory and speed efficiency for Booleans #14821

GregoryMorse · 2019-11-01T00:48:42Z

Using pure Boolean values is pretty limiting when dealing with very large data due to memory waste. np.array will use 1 byte per Boolean which although better than 4 or 8, is still 8 times waste.

Easy enough to pack the Boolean value np.array as bytes (np.uint8):

def boolarr_tobytes(arr):
    rem = len(arr) % 8
    if rem != 0: arr = np.concatenate((arr, np.zeros(8 - rem, dtype=np.bool_)))
    arr = np.reshape(arr, (int(len(arr) / 8), 8))
    return np.packbits(arr) #translates boolean to bytes if array shape (n, 8) with high bits first

Then & and | provide bitwise operations already. And np.sum can be written as:

bytebitcounts = np.array([bin(x).count("1") for x in range(256)])
def totalbits_bytearr(arr):
    return np.sum(bytebitcounts[arr])

Now I am truly supposing that the table lookup which uses the imaging table translation function is vectorized properly. I would imagine since it is used heavily for image processing that it is. This would be 2 vector operations (np.sum and table lookup) instead of 1 np.sum. PSHUFB (packed bytes shuffle) is the name of the processor intrinsic which can do byte table lookup translation. However, since the AVX/SSE2 and like instructions have data limits, 8 times less vector operations would occur per vector operation. 1 vector operation * 8 vs 2 vector operations is still 4 times faster.

So if numpy would dare to add a whole data type to use packed byte representations of Booleans instead (which might be a major change which would need to be implemented), it would decrease memory by 8 times, increase vector operations by 8 times except where bit twiddling like mentioned is needed where it would depending on the specific operation still tend to be faster.

I can see no reason why this would not be highly desirable for the library especially since large datasets are pretty typical.

Yes the primary problem would be endless indexing oddities (1 << bitoffset) & value, along with if set: value |= (1 << bitoffset). But a lot of things are already implicitly supported.

Half of the operations are probably trivial like multiplication, addition as shown, and a few would require some real thinking.

It would make these Python libraries as flexibly scalable as C though so it would be impressive. This would further positively effect a great many libraries out there giving dramatic potential increases in large data sets.

The text was updated successfully, but these errors were encountered:

eric-wieser · 2019-11-01T01:00:59Z

Note that you can probably get 90% of the performance you want with numba:

import numba
lookup_table = np.array([bin(x).count("1") for x in range(256)], np.uint8)
@numba.guvectorize([(numba.uint8[:], numba.int64[:])], '(n)->()', nopython=True)
def sum_bits(x, res):
    res[0] = 0
    for xi in x:
        res[0] += lookup_table[xi]

In [41]: a = np.array([0b1100, 0b11011011], np.uint8)

In [42]: sum_bits(a)
Out[42]: 8

In [43]: %timeit sum_bits(a)
1.83 µs ± 80.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [44]: def totalbits_bytearr(arr):
    ...:     return np.sum(bytebitcounts[arr])
    ...:

In [45]: bytebitcounts=lookup_table

In [46]: %timeit totalbits_bytearr(a)
12.7 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

GregoryMorse · 2019-11-01T01:22:36Z

Thank you very much for that insight and very impressive results. So numpy is not vectorizing things as nicely as was thought then? I had assumed it was... Also surprised to see the numba can take Python code with initialization and figure out how to correctly vectorize a for loop an outer operation of addition and an inner operation of a table lookup. That is impressive. So apparently this report, is mainly to add a new packedbits data type, which has a bunch of numba code for all the ufuncs. But the indexing/accessing/setting is extra bit twiddling.

I suppose bitwise & and | are already implemented using numba or could they be improved too due to their generic data type inspection? Update: 5 times slower using guvectorize than the native &, |.

eric-wieser · 2019-11-01T09:11:19Z

Also surprised to see the numba can take Python code with initialization and figure out how to correctly vectorize

I think you have the wrong model of how numba works, like I used to. I think the things you need to know about the numba are that:

it translates python into unoptimized LLVM IR
it treats any closures (eg the lookup table) as compile-time constants
LLVM does all the heavy lifting regarding optimization, the same as is used by clang

which has a bunch of numba code for all the ufuncs.

Numpy cannot take a dependency on numba, it would make everything far too cyclic.

I suppose bitwise & and | are already implemented using numba

They're implemented using native loops in C. The problem with np.sum(bytebitcounts[arr]) is it uses two loops, an intermediate array, and no compile-time knowledge of the lookup table.

GregoryMorse · 2019-11-01T09:18:29Z

Certainly did not know about the table lookup optimization. So sounds like it could be wrong to think that we are using the SSE2/AVX instructions in the silicon for that sort of batched parallelism unless LLVM has gotten to that level. These type of operations could even be thrown over to the GPU if dealing with large enough data sets or done with multi-threading, etc.

seberg · 2019-11-18T02:05:26Z

I am a bit sorry about it. But I will close this. I am not against helpers for speeding up or handling bit arrays better, but I do not think they fit on the current agenda/roadmap (or design) of numpy, since it would require things like bit-sized strides or similar.
This does not mean that we cannot do it, but I think it needs some careful design proposals outside of the issue tracker.

GregoryMorse mentioned this issue Nov 1, 2019

metrics.confusion_matrix far too slow for Boolean cases scikit-learn/scikit-learn#15388

Closed

seberg closed this as completed Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy lacks memory and speed efficiency for Booleans #14821

numpy lacks memory and speed efficiency for Booleans #14821

GregoryMorse commented Nov 1, 2019 •

edited

Loading

eric-wieser commented Nov 1, 2019 •

edited

Loading

GregoryMorse commented Nov 1, 2019 •

edited

Loading

eric-wieser commented Nov 1, 2019 •

edited

Loading

GregoryMorse commented Nov 1, 2019

seberg commented Nov 18, 2019

numpy lacks memory and speed efficiency for Booleans #14821

numpy lacks memory and speed efficiency for Booleans #14821

Comments

GregoryMorse commented Nov 1, 2019 • edited Loading

eric-wieser commented Nov 1, 2019 • edited Loading

GregoryMorse commented Nov 1, 2019 • edited Loading

eric-wieser commented Nov 1, 2019 • edited Loading

GregoryMorse commented Nov 1, 2019

seberg commented Nov 18, 2019

GregoryMorse commented Nov 1, 2019 •

edited

Loading

eric-wieser commented Nov 1, 2019 •

edited

Loading

GregoryMorse commented Nov 1, 2019 •

edited

Loading

eric-wieser commented Nov 1, 2019 •

edited

Loading