Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. #17102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Dec 19, 2020

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Aug 19, 2020

np.packbits has already optimized by intrinsics in SSE&NEON, It's can be easily extends to AVX2 by using universal intrinsics.
Here is the Benchmark results:

  • X86-AVX2 enabled under MSVC Compiler (version 14.26.28801), with the args /arch o2
       before           after         ratio
     [7b7e7fe4]       [c480ffba]
     <master>         <usimd-compiled>
-      16.8±0.4μs       10.1±0.2μs     0.60  bench_core.PackBits.time_packbits(<class 'bool'>)
-         306±8μs          172±1μs     0.56  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

in SSE/NEON platform the performance are not significantly changed. here is The size change of _multiarray_umath.cp37-win_amd64.pyd

master usimd-compiled
2629KB 2703KB

Qiyu8 added 5 commits August 11, 2020 09:52
Merge branch 'usimd-compiled' of github.com:Qiyu8/numpy into usimd-compiled
@mattip
Copy link
Member

mattip commented Aug 20, 2020

Hang on, this is not a ufunc. How will it dispatch? We have to use the cpu-baseline flags for this function, which I think do not include AVX2.

@seiko2plus
Copy link
Member

seiko2plus commented Aug 20, 2020

@mattip, The new dispatcher is already involved, explained as following:

@Qiyu8, 1. moved the old SIMD loop into a new dispatch-able source compiled_base.dispatch.c

/**
* @targets $maxopt baseline
* SSE2 AVX2
* NEON ASIMDDP
*/
#include "compiled_base.h"
/*
* This function packs boolean values in the input array into the bits of a
* byte array. Truth values are determined as usual: 0 is false, everything
* else is true.
*/
NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(compiled_base_pack_inner)
(const char *inptr, npy_intp element_size, npy_intp n_in, npy_intp in_stride, char *outptr, npy_intp n_out, npy_intp out_stride, char order)
{

2. forward declarations for the exported function in here

#ifndef NPY_DISABLE_OPTIMIZATION
#include "compiled_base.dispatch.h"
#endif
NPY_CPU_DISPATCH_DECLARE(NPY_NO_EXPORT void compiled_base_pack_inner,
(const char *inptr, npy_intp element_size, npy_intp n_in, npy_intp in_stride, char *outptr, npy_intp n_out, npy_intp out_stride, char order))

3. the runtime call code implemented in the place of old SIMD code

static NPY_INLINE void
pack_inner(const char *inptr,
npy_intp element_size, /* in bytes */
npy_intp n_in,
npy_intp in_stride,
char *outptr,
npy_intp n_out,
npy_intp out_stride,
char order)
{
#ifndef NPY_DISABLE_OPTIMIZATION
#include "compiled_base.dispatch.h"
#endif
NPY_CPU_DISPATCH_CALL(return compiled_base_pack_inner,
(inptr, element_size, n_in, in_stride, outptr, n_out, out_stride, order))
}

@Qiyu8
Copy link
Member Author

Qiyu8 commented Aug 21, 2020

yes, the first step explains why AVX2 is enabled without the cpu-baseline flags, we defined new baseline in front of the dispatch file.

* @targets $maxopt baseline 
  * SSE2 AVX2 

@Qiyu8 Qiyu8 changed the title USIMD: Optimize the performance of np.packbits in AVX2. USIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. Aug 21, 2020
Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comments about use of .h files

@charris charris changed the title USIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. SIMD: Optimize the performance of np.packbits in AVX2/AVX512F/VSX. Aug 24, 2020
@Qiyu8 Qiyu8 added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Nov 12, 2020
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please could you replace it with npyv_tobits_b8 within #17789, it should perform better on aarch64.

@seberg seberg added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Nov 18, 2020
@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 11, 2020

@seiko2plus The new intrinsics really improved a lot in performance, The running time reduced by a staggering 93%, Here is the lastest benchmark result:

       before           after         ratio
     [91d9bbeb]       [711482a8]
     <master>         <usimd-compiled>
-        19.4±1μs       3.51±0.6μs     0.18  bench_core.PackBits.time_packbits(<class 'bool'>)
-        347±10μs         25.4±2μs     0.07  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
AVX2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building 711482a8  for virtualenv-py3.7-Cython
·· Installing 711482a8  into virtualenv-py3.7-Cython
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For numpy commit 91d9bbeb  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  8.33%] ··· Running (bench_core.PackBits.time_packbits--)...
[ 25.00%] · For numpy commit 711482a8  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 33.33%] ··· Running (bench_core.PackBits.time_packbits--)...
[ 50.00%] · For numpy commit 711482a8  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 58.33%] ··· bench_core.PackBits.time_packbits                                                                                                                                       ok
[ 58.33%] ··· ============== ============
                  dtype
              -------------- ------------
                   bool       3.51±0.6μs
               numpy.uint64   91.0±20μs
              ============== ============

[ 66.67%] ··· bench_core.PackBits.time_packbits_axis0 ok
[ 66.67%] ··· ============== =============
dtype
-------------- -------------
bool 378±4μs
numpy.uint64 1.42±0.04ms
============== =============

[ 75.00%] ··· bench_core.PackBits.time_packbits_axis1 ok
[ 75.00%] ··· ============== =============
dtype
-------------- -------------
bool 25.4±2μs
numpy.uint64 1.23±0.05ms
============== =============

[ 75.00%] · For numpy commit 91d9bbeb (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 83.33%] ··· bench_core.PackBits.time_packbits ok
[ 83.33%] ··· ============== ==========
dtype
-------------- ----------
bool 19.4±1μs
numpy.uint64 62.0±1μs
============== ==========

[ 91.67%] ··· bench_core.PackBits.time_packbits_axis0 ok
[ 91.67%] ··· ============== =============
dtype
-------------- -------------
bool 370±20μs
numpy.uint64 1.41±0.03ms
============== =============

[100.00%] ··· bench_core.PackBits.time_packbits_axis1 ok
[100.00%] ··· ============== =============
dtype
-------------- -------------
bool 347±10μs
numpy.uint64 1.31±0.05ms
============== =============

   before           after         ratio
 [91d9bbeb]       [711482a8]
 <master>         <usimd-compiled>
  •    19.4±1μs       3.51±0.6μs     0.18  bench_core.PackBits.time_packbits(<class 'bool'>)
    
  •    347±10μs         25.4±2μs     0.07  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@mattip
Copy link
Member

mattip commented Dec 11, 2020

Does that increase make sense? Maybe it is somehow skipping the execution. Does the benchmark check that the result is correct?

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 14, 2020

@mattip AFAIK, The benchmark only checks the performance, the correctness should verified through a mass of test cases. The log was printed in the simd loop when running the test case, so I'm sure the execution process is not skipped, Do you suggest to import a new benchmark library such as google benchmark instead of asv in order to get a more convincing result?

@mattip
Copy link
Member

mattip commented Dec 14, 2020

Thanks for the clarification. Is a 14x speed increase expected? My intuition, which is frequently wrong, is that if something is too good to be true it means there is a bug.

@seiko2plus
Copy link
Member

seiko2plus commented Dec 14, 2020

@mattip, Msvc and bureaucracy are two sides of the same coin, they exist to kill the performance :).
The difference that makes this boost is swapping the order of bytes(big-endian) via npyv_rev64_u8, I was expected 2x to 3x maximum change on GCC but yeah MSVC always have a bad gap in auto-vectorization but I think that issue can be reduced with the flags /arch:AVX and /O2 which can be achieved here by #17736 and --cpu-baseline=avx2.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, Could you add a benchmark test for little order?

def time_packbits(self, dtype):

bb[2] = npyv_tobits_b8(npyv_cmpneq_u8(v2, v_zero));
bb[3] = npyv_tobits_b8(npyv_cmpneq_u8(v3, v_zero));
if(out_stride == 1 &&
(!NPY_STRONG_ALIGNMENT || npy_is_aligned(outptr, sizeof(npy_uint64)))) {
Copy link
Member

@seiko2plus seiko2plus Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to iterate npy_is_aligned(outptr, sizeof(npy_uint64)), just one call is needed before the loop.

} else {
for(int i = 0; i < 4; i++) {
for (int j = 0; j < vstep; j++) {
memcpy(outptr, (char*)&bb[i] + j, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the compiler gonna optimize it but why do we need memcpy for storing one byte?

Copy link
Member Author

@Qiyu8 Qiyu8 Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have npyv_storen_till_u32 now, Do you suggest to add npyv_storen_till_u8 for one byte non-contiguous partial store? IMO, small size memcpy optimization can be treated as a special optimization in follow-up PRs, This function is called in many places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-contiguous/partial memory access intrinsics for "u8, s8, u16, s16" will be useful for other SIMD kernels but not this one. I just suggested storing each bye via dereferencing the output ptr.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Dec 16, 2020

@mattip I fully understand why you surprised about the performance improvement, but after multiple log tests, I believe that the result is correct and here is the latest benchmark result(for both little order and big order):

       before           after         ratio
     [9e26d1d2]       [b156231e]
     <master>         <usimd-compiled>
-        86.5±2μs         73.8±3μs     0.85  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
-        21.2±6μs       3.40±0.3μs     0.16  bench_core.PackBits.time_packbits(<class 'bool'>)
-        23.3±2μs      2.74±0.07μs     0.12  bench_core.PackBits.time_packbits_little(<class 'bool'>)
-         346±9μs       31.0±0.3μs     0.09  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@mattip
Copy link
Member

mattip commented Dec 16, 2020

Cool. Looks like a clear win.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Thank you

@mattip mattip merged commit 3095b43 into numpy:master Dec 19, 2020
@mattip
Copy link
Member

mattip commented Dec 19, 2020

Thanks @Qiyu8

#endif
#if !defined(NPY_STRONG_ALIGNMENT)
#define NPY_STRONG_ALIGNMENT 0
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this addition. Is this the same (idea) as NPY_CPU_HAVE_UNALIGNED_ACCESS, which we do already use in a handful of places? I am a bit worried of duplicating this logic, especially since this seems to assume that all CPUs have unaligned access instead of the opposite of assuming x86 and amd64 always supporting unaligned access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, in this case, only armv7 needs aligned access, which assume that other CPUs supporting unaligned access, so NPY_CPU_HAVE_UNALIGNED_ACCESS and NPY_STRONG_ALIGNMENT are like two sides of a coin, you are right about the duplicating logic, will integrate together in follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants