-
-
Notifications
You must be signed in to change notification settings - Fork 11k
SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. #17102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hang on, this is not a ufunc. How will it dispatch? We have to use the cpu-baseline flags for this function, which I think do not include AVX2. |
@mattip, The new dispatcher is already involved, explained as following: @Qiyu8, 1. moved the old SIMD loop into a new dispatch-able source
|
/** | |
* @targets $maxopt baseline | |
* SSE2 AVX2 | |
* NEON ASIMDDP | |
*/ | |
#include "compiled_base.h" | |
/* | |
* This function packs boolean values in the input array into the bits of a | |
* byte array. Truth values are determined as usual: 0 is false, everything | |
* else is true. | |
*/ | |
NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(compiled_base_pack_inner) | |
(const char *inptr, npy_intp element_size, npy_intp n_in, npy_intp in_stride, char *outptr, npy_intp n_out, npy_intp out_stride, char order) | |
{ |
2. forward declarations for the exported function in here
numpy/numpy/core/src/multiarray/compiled_base.h
Lines 21 to 26 in 81e08d9
#ifndef NPY_DISABLE_OPTIMIZATION | |
#include "compiled_base.dispatch.h" | |
#endif | |
NPY_CPU_DISPATCH_DECLARE(NPY_NO_EXPORT void compiled_base_pack_inner, | |
(const char *inptr, npy_intp element_size, npy_intp n_in, npy_intp in_stride, char *outptr, npy_intp n_out, npy_intp out_stride, char order)) |
3. the runtime call code implemented in the place of old SIMD code
numpy/numpy/core/src/multiarray/compiled_base.c
Lines 1488 to 1503 in 81e08d9
static NPY_INLINE void | |
pack_inner(const char *inptr, | |
npy_intp element_size, /* in bytes */ | |
npy_intp n_in, | |
npy_intp in_stride, | |
char *outptr, | |
npy_intp n_out, | |
npy_intp out_stride, | |
char order) | |
{ | |
#ifndef NPY_DISABLE_OPTIMIZATION | |
#include "compiled_base.dispatch.h" | |
#endif | |
NPY_CPU_DISPATCH_CALL(return compiled_base_pack_inner, | |
(inptr, element_size, n_in, in_stride, outptr, n_out, out_stride, order)) | |
} |
yes, the first step explains why AVX2 is enabled without the cpu-baseline flags, we defined new baseline in front of the dispatch file.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comments about use of .h
files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please could you replace it with npyv_tobits_b8 within #17789, it should perform better on aarch64.
@seiko2plus The new intrinsics really improved a lot in performance, The running time reduced by a staggering 93%, Here is the lastest benchmark result:
AVX2 enabled
|
Does that increase make sense? Maybe it is somehow skipping the execution. Does the benchmark check that the result is correct? |
@mattip AFAIK, The benchmark only checks the performance, the correctness should verified through a mass of test cases. The log was printed in the simd loop when running the test case, so I'm sure the execution process is not skipped, Do you suggest to import a new benchmark library such as google benchmark instead of asv in order to get a more convincing result? |
Thanks for the clarification. Is a 14x speed increase expected? My intuition, which is frequently wrong, is that if something is too good to be true it means there is a bug. |
@mattip, Msvc and bureaucracy are two sides of the same coin, they exist to kill the performance :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, Could you add a benchmark test for little order?
numpy/benchmarks/benchmarks/bench_core.py
Line 165 in c9f9081
def time_packbits(self, dtype): |
bb[2] = npyv_tobits_b8(npyv_cmpneq_u8(v2, v_zero)); | ||
bb[3] = npyv_tobits_b8(npyv_cmpneq_u8(v3, v_zero)); | ||
if(out_stride == 1 && | ||
(!NPY_STRONG_ALIGNMENT || npy_is_aligned(outptr, sizeof(npy_uint64)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to iterate npy_is_aligned(outptr, sizeof(npy_uint64))
, just one call is needed before the loop.
} else { | ||
for(int i = 0; i < 4; i++) { | ||
for (int j = 0; j < vstep; j++) { | ||
memcpy(outptr, (char*)&bb[i] + j, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the compiler gonna optimize it but why do we need memcpy
for storing one byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have npyv_storen_till_u32
now, Do you suggest to add npyv_storen_till_u8
for one byte non-contiguous partial store? IMO, small size memcpy
optimization can be treated as a special optimization in follow-up PRs, This function is called in many places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-contiguous/partial memory access intrinsics for "u8, s8, u16, s16" will be useful for other SIMD kernels but not this one. I just suggested storing each bye via dereferencing the output ptr.
@mattip I fully understand why you surprised about the performance improvement, but after multiple log tests, I believe that the result is correct and here is the latest benchmark result(for both little order and big order):
|
Cool. Looks like a clear win. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done, Thank you
Thanks @Qiyu8 |
#endif | ||
#if !defined(NPY_STRONG_ALIGNMENT) | ||
#define NPY_STRONG_ALIGNMENT 0 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this addition. Is this the same (idea) as NPY_CPU_HAVE_UNALIGNED_ACCESS
, which we do already use in a handful of places? I am a bit worried of duplicating this logic, especially since this seems to assume that all CPUs have unaligned access instead of the opposite of assuming x86 and amd64 always supporting unaligned access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, in this case, only armv7
needs aligned access, which assume that other CPUs supporting unaligned access, so NPY_CPU_HAVE_UNALIGNED_ACCESS
and NPY_STRONG_ALIGNMENT
are like two sides of a coin, you are right about the duplicating logic, will integrate together in follow-up PR.
np.packbits
has already optimized by intrinsics in SSE&NEON, It's can be easily extends to AVX2 by using universal intrinsics.Here is the Benchmark results:
/arch o2
in SSE/NEON platform the performance are not significantly changed. here is The size change of
_multiarray_umath.cp37-win_amd64.pyd