ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265

kawakami-k · 2022-09-15T15:46:39Z

This PR enhances the CPU feature detection function so as to detect Arm SVE architecture. It also includes vectorized functions for SVE that is implemented similarly to AVX/AVX2/ASIMD. The regression test (runtest.py) was executed on Fujtisu FX700 with A64FX which is one of Armv8.2a + SVE architecture compliant CPU. The result was "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 xpassed".

Because SVE2 (Armv9) is upper-compatible instruction set of SVE, I believe this PR also improves NumPy preformance running on SVE2 environment.

EwoutH · 2022-09-16T11:35:11Z

Thanks a lot for this effort!

Do you by any chance have any performance benchmarks? (see maybe benchmark docs)

seiko2plus · 2022-09-17T20:56:09Z

Compile-time sizeless SIMD extensions should be treated as it designed. providing compiled objects for each possible width (256, 512, 1024, 2048) going to increase the binary size and maintenance efforts. note the current SVE implementation only supports 512bit width.

IMHO, it's better to wait till we get done with the C++ interface of universal intrinsics which is designed to support sizeless SIMD extensions(#21057) since we are moving to C++ anyway. However, we still can modify the C interface to make it friendly with sizeless SIMD extensions. thoughts? @charris, @rgommers, @mattip

kawakami-k · 2022-09-18T02:36:39Z

Hi, @seiko2plus

Thank you for the comment.

When I tried to implement in a size-less manner, I couldn't implement the following part (typedef union "simd_data"). If this part can be solved, I think all other parts (core/src/common/simd/sve/(conversion|memory).h) can be done in a size-less manner.
https://github.com/kawakami-k/numpy/blob/sve_enablement/numpy/core/src/_simd/_simd_inc.h.src#L14-L56
https://github.com/kawakami-k/numpy/blob/sve_enablement/numpy/core/src/common/simd/sve/sve.h#L5-L12

kawakami-k · 2022-09-18T02:54:42Z

Hi, @EwoutH

In my environment (512bit SVE), I've confirmed more than three times faster performance gain depending on the type of benchmark. I've also observed a performance drop of a few percent on some benchmarks. I need in-company confirmation to disclose absolute processing time of the benchmark. Could you give me a week of time?

seiko2plus · 2022-09-18T03:59:12Z

Hi @kawakami-k,

The _simd module (testing proposes) has been reimplemented in C++ and
has become fully friendly with sizeless SIMD. check the following path(part of #2105): https://github.com/numpy/numpy/tree/efa6ebea6f88c64bcdd5b8d492c13c9cc30536d2/numpy/core/src/_simd

I would suggest to postponed your current work for 1-2 months till we get done from #2105,
then refactoring this work together, and rewriting the C SIMD kernels intrinsics in C++ wouldn't be that hard.
What do you think?

rgommers · 2022-09-20T18:08:16Z

I need in-company confirmation to disclose absolute processing time of the benchmark. Could you give me a week of time?

In case you don't get permission to post absolute numbers, you could perhaps take the output of python runtests.py --bench-compare and edit it to show only the relative speedups.

kawakami-k · 2022-09-21T00:34:43Z

Hi, @rgommers

In case you don't get permission to post absolute numbers, you could perhaps take the output of python runtests.py --bench-compare and edit it to show only the relative speedups.

Thank you for the comment. 1) One idea is to disclose the relative performance. 2) As another idea, I'm preparing to run benchmarks on AWS Graviton3 (SVE 256bit). For now, I'm going to do with 2).

Since I will not have much time in September, I will measure and compare benchmark in early October. Thank you.

kawakami-k · 2022-09-21T00:47:07Z

Hi, @seiko2plus

I would suggest to postponed your current work for 1-2 months till we get done from #2105, then refactoring this work together, and rewriting the C SIMD kernels intrinsics in C++ wouldn't be that hard. What do you think?

I haven't had time to understand the new C++ interface proposal, but it's no problem to modify this PR to fit the new interface. Thank you.

EwoutH · 2022-09-29T19:17:19Z

In my environment (512bit SVE), I've confirmed more than three times faster performance gain depending on the type of benchmark.

This sounds really awesome! Are you able to share any numbers?

If absolute figures (seconds) aren't possible, you could also share speedups compared to NEON or to plain C (so 2.39x speed for example)

kawakami-k · 2022-10-04T00:03:57Z

@EwoutH @rgommers

The below is the benchmark result on AWS Graviton3 (SVE 256).
Only results with significant differences are extracted.

python runtests.py --cpu-baseline=sve --cpu-dispatch=none --bench-compare sve_enablement256
python runtests.py --bench-compare main
cd benchmarks
asv compare 86cd584b ffe9cf2c | egrep -e "^-" -e "^\+"

The source code I used is 86cd584b and ffe9cf2c.

The implementation has been changed to be as SIMD size-independent as possible,
but the explicit size specification to the compiler is required.
https://github.com/kawakami-k/numpy/blob/sve_enablement256/numpy/distutils/ccompiler_opt.py#L541

All benchmarks:

       before           after         ratio
     [86cd584b]       [ffe9cf2c]
     <main>           <sve_enablement>
-        4.81±0μs      4.30±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
-      51.1±0.1μs       44.6±0.2μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)
-      98.9±0.2μs       86.0±0.3μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
-       196±0.2μs        172±0.3μs     0.88  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
-         391±2μs          340±3μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
-      51.0±0.2μs       44.3±0.3μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
-     2.81±0.01μs         2.53±0μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
-        4.81±0μs         4.31±0μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
-     8.68±0.01μs      7.67±0.01μs     0.88  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
-       103±0.1μs       86.1±0.3μs     0.83  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
-       196±0.6μs          172±1μs     0.88  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
-         390±1μs          345±3μs     0.88  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
-         803±9μs         723±10μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
-       103±0.2μs       85.9±0.2μs     0.83  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
-     3.82±0.01μs      3.41±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
-     6.74±0.01μs      5.99±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
-     12.5±0.02μs      11.0±0.02μs     0.88  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
-       155±0.4μs        129±0.3μs     0.83  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
-         294±2μs          257±1μs     0.87  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
-         586±1μs          522±5μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
-     1.29±0.02ms      1.13±0.04ms     0.88  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
-       155±0.4μs        129±0.3μs     0.83  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
-     10.7±0.01μs         8.45±0μs     0.79  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'bool'>)
-     12.2±0.02μs      9.66±0.01μs     0.79  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int16'>)
-     12.9±0.02μs      10.4±0.08μs     0.80  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int32'>)
-     14.6±0.03μs      11.5±0.02μs     0.79  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int64'>)
-     11.7±0.01μs      9.48±0.03μs     0.81  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int8'>)
-      31.8±0.9μs      28.3±0.06μs     0.89  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'str'>)
-         651±5μs        413±0.7μs     0.63  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'bool'>)
-         727±2μs        492±0.9μs     0.68  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int16'>)
-         815±3μs          570±2μs     0.70  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int32'>)
-         964±5μs          699±1μs     0.72  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int64'>)
-         684±2μs        453±0.6μs     0.66  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int8'>)
-     17.2±0.03μs      12.5±0.02μs     0.73  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'bool'>)
-     19.7±0.03μs      14.5±0.01μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int16'>)
-      21.1±0.1μs      15.7±0.01μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int32'>)
-     24.1±0.04μs      18.3±0.05μs     0.76  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>)
-     18.7±0.03μs      13.9±0.01μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int8'>)
-     1.31±0.01ms          826±2μs     0.63  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'bool'>)
-     1.49±0.02ms         1.00±0ms     0.67  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int16'>)
-     1.67±0.02ms         1.15±0ms     0.69  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int32'>)
-     1.99±0.02ms         1.43±0ms     0.72  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int64'>)
-     1.41±0.02ms          920±2μs     0.65  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int8'>)
-     23.6±0.02μs      16.6±0.03μs     0.70  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'bool'>)
-     26.9±0.02μs      19.2±0.02μs     0.71  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int16'>)
-      29.0±0.2μs      21.1±0.01μs     0.73  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int32'>)
-      33.4±0.3μs       24.9±0.1μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int64'>)
-     25.5±0.04μs      18.4±0.05μs     0.72  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int8'>)
-        1.98±0ms         1.23±0ms     0.62  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'bool'>)
-     2.28±0.02ms         1.50±0ms     0.66  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int16'>)
-     2.54±0.02ms         1.72±0ms     0.68  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int32'>)
-     3.05±0.02ms      2.16±0.01ms     0.71  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int64'>)
-     2.15±0.02ms         1.39±0ms     0.64  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int8'>)
-     11.8±0.02ms      10.6±0.06ms     0.90  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'str'>)
-     11.0±0.02μs      8.79±0.02μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'bool'>)
-     12.5±0.06μs      10.1±0.01μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int16'>)
-     13.2±0.02μs      10.6±0.03μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int32'>)
-     14.9±0.03μs      11.9±0.02μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int64'>)
-     12.0±0.02μs      9.84±0.04μs     0.82  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int8'>)
-      32.1±0.8μs       28.9±0.1μs     0.90  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'str'>)
-         655±3μs        412±0.5μs     0.63  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'bool'>)
-         728±4μs        494±0.9μs     0.68  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int16'>)
-         813±4μs        572±0.8μs     0.70  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int32'>)
-         964±5μs          703±1μs     0.73  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int64'>)
-         684±3μs        455±0.9μs     0.67  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int8'>)
-     17.3±0.02μs      12.7±0.02μs     0.73  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'bool'>)
-     19.8±0.03μs      14.8±0.07μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int16'>)
-     21.2±0.03μs      16.0±0.04μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int32'>)
-     24.2±0.09μs       18.5±0.1μs     0.77  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int64'>)
-     18.8±0.02μs      14.1±0.03μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int8'>)
-     1.31±0.01ms          821±2μs     0.63  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'bool'>)
-     1.51±0.01ms          998±1μs     0.66  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int16'>)
-     1.69±0.01ms         1.15±0ms     0.68  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int32'>)
-     1.99±0.01ms      1.42±0.01ms     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int64'>)
-     1.42±0.01ms        917±0.8μs     0.65  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int8'>)
-     23.6±0.01μs      16.7±0.04μs     0.71  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'bool'>)
-     26.9±0.02μs      19.4±0.06μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int16'>)
-      28.9±0.2μs      21.4±0.03μs     0.74  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int32'>)
-      33.6±0.2μs       25.3±0.4μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int64'>)
-     25.5±0.03μs      18.4±0.02μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int8'>)
-     1.98±0.01ms         1.23±0ms     0.62  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'bool'>)
-     2.27±0.01ms         1.49±0ms     0.66  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int16'>)
-     2.54±0.01ms         1.72±0ms     0.68  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int32'>)
-     3.04±0.01ms      2.15±0.01ms     0.71  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int64'>)
-     2.15±0.02ms         1.38±0ms     0.64  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int8'>)
-     11.7±0.06ms      10.6±0.07ms     0.90  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'str'>)
+        1.75±0μs      2.95±0.03μs     1.69  bench_core.PackBits.time_packbits(<class 'bool'>)
+     19.4±0.02μs      29.0±0.05μs     1.50  bench_core.PackBits.time_packbits(<class 'numpy.uint64'>)
+         367±2μs        573±0.6μs     1.56  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
+        433±20μs          605±2μs     1.40  bench_core.PackBits.time_packbits_axis0(<class 'numpy.uint64'>)
+     18.4±0.02μs      45.3±0.09μs     2.47  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)
+       370±0.2μs        561±0.5μs     1.52  bench_core.PackBits.time_packbits_axis1(<class 'numpy.uint64'>)
+     1.95±0.01μs      3.06±0.01μs     1.57  bench_core.PackBits.time_packbits_little(<class 'bool'>)
+     19.7±0.02μs      28.4±0.03μs     1.44  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
-         817±4μs         704±10μs     0.86  bench_core.Temporaries.time_large
-      43.5±0.1μs       36.3±0.5μs     0.83  bench_core.Temporaries.time_mid
-      43.1±0.3μs      35.8±0.09μs     0.83  bench_core.Temporaries.time_mid2
-     7.12±0.01μs      6.32±0.04μs     0.89  bench_core.UnpackBits.time_unpackbits_little
+       182±0.3μs          205±2μs     1.12  bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10))
-      78.9±0.1μs      69.1±0.07μs     0.88  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
-     47.2±0.07μs       39.9±0.1μs     0.85  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
-      79.5±0.1μs       70.0±0.1μs     0.88  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
-      47.4±0.1μs      40.4±0.04μs     0.85  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
+     69.1±0.08μs      79.1±0.06μs     1.14  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+     39.9±0.06μs      47.1±0.04μs     1.18  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+         400±1μs          481±1μs     1.20  bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))
+       453±0.9μs        514±0.7μs     1.13  bench_function_base.Sort.time_sort('heap', 'float32', ('reversed',))
+         552±1μs        636±0.6μs     1.15  bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 10))
+         613±1μs        869±300μs     1.42  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
-         624±3μs          553±2μs     0.89  bench_function_base.Sort.time_sort('merge', 'float64', ('random',))
+      67.7±0.1μs         115±50μs     1.70  bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))
-     7.14±0.01μs      6.40±0.02μs     0.90  bench_function_base.Where.time_all_zeros
+     5.95±0.01μs      6.69±0.03μs     1.12  bench_indexing.ScalarIndexing.time_assign(0)
-         527±3μs          474±4μs     0.90  bench_lib.Nan.time_nanmean(200000, 0)
-         530±3μs          475±5μs     0.90  bench_lib.Nan.time_nanmean(200000, 0.1)
-     1.12±0.03ms      1.01±0.03ms     0.90  bench_lib.Pad.time_pad((1024, 1024), (0, 32), 'mean')
-        1.13±0ms         1.01±0ms     0.90  bench_lib.Pad.time_pad((1024, 1024), 1, 'mean')
-     1.09±0.04ms         970±30μs     0.89  bench_lib.Pad.time_pad((1024, 1024), 8, 'mean')
-       154±0.5μs        137±0.6μs     0.89  bench_lib.Pad.time_pad((4, 4, 4, 4), 8, 'constant')
-      61.2±0.7μs         49.3±1μs     0.81  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-       108±0.5μs         79.2±3μs     0.73  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-     43.2±0.05μs       39.0±0.1μs     0.90  bench_linalg.Linalg.time_op('norm', 'float16')
-     26.0±0.08μs       22.3±0.2μs     0.86  bench_random.Choice.time_choice(1000.0)
-       390±0.4μs        272±0.7μs     0.70  bench_reduce.AddReduceSeparate.time_reduce(0, 'float64')
-         553±9μs          443±1μs     0.80  bench_reduce.AddReduceSeparate.time_reduce(0, 'int16')
-         576±2μs        464±0.8μs     0.81  bench_reduce.AddReduceSeparate.time_reduce(0, 'int32')
-       339±0.5μs        280±0.4μs     0.83  bench_reduce.AddReduceSeparate.time_reduce(0, 'int64')
-        2.02±0ms         1.84±0ms     0.91  bench_reduce.AddReduceSeparate.time_reduce(1, 'float16')
-         625±1μs          403±2μs     0.64  bench_reduce.AddReduceSeparate.time_reduce(1, 'int16')
-         598±6μs          423±1μs     0.71  bench_reduce.AddReduceSeparate.time_reduce(1, 'int32')
-       385±0.2μs        233±0.3μs     0.61  bench_reduce.AddReduceSeparate.time_reduce(1, 'int64')
-      5.52±0.3μs      5.00±0.05μs     0.91  bench_reduce.AnyAll.time_any_slow
+     7.87±0.01μs      14.9±0.08μs     1.89  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-       122±0.2μs         69.1±5μs     0.56  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-     9.86±0.06μs      8.77±0.04μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
-       122±0.2μs         69.1±5μs     0.57  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-     9.87±0.03μs      8.83±0.06μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-      93.5±0.1μs       81.3±0.3μs     0.87  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-      33.8±0.2μs       29.8±0.8μs     0.88  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-       119±0.2μs         59.6±1μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-     9.85±0.02μs      8.81±0.01μs     0.89  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-      17.6±0.1μs      15.7±0.09μs     0.89  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-      33.6±0.2μs       29.5±0.9μs     0.88  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       119±0.1μs         59.1±1μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-     9.84±0.01μs      8.72±0.01μs     0.89  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-     7.35±0.01μs      6.18±0.01μs     0.84  bench_reduce.MinMax.time_max(<class 'numpy.int64'> (0))
-     7.36±0.04μs      6.18±0.01μs     0.84  bench_reduce.MinMax.time_max(<class 'numpy.int64'> (1))
-        7.36±0μs      6.18±0.01μs     0.84  bench_reduce.MinMax.time_max(<class 'numpy.uint64'>)
-     7.36±0.05μs      6.20±0.02μs     0.84  bench_reduce.MinMax.time_min(<class 'numpy.int64'> (0))
-       186±0.7ms          166±1ms     0.89  bench_shape_base.Kron.time_arr_kron
-       115±0.3ms          104±2ms     0.91  bench_shape_base.Kron.time_mat_kron
+     1.52±0.01μs      1.76±0.09μs     1.16  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '!=')
+      8.57±0.1μs       14.5±0.5μs     1.69  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
+     17.3±0.09μs       20.8±0.5μs     1.20  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
+     4.87±0.02μs       6.92±0.2μs     1.42  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
+      8.54±0.1μs       14.0±0.5μs     1.64  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
+      17.3±0.1μs         21.1±1μs     1.22  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
+     4.87±0.02μs       6.95±0.2μs     1.43  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
+      8.54±0.1μs       13.4±0.5μs     1.57  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
+      17.2±0.1μs         21.4±1μs     1.25  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
+     6.53±0.01μs       13.0±0.8μs     1.99  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
+      11.5±0.2μs       16.1±0.4μs     1.40  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
+     4.26±0.01μs      6.82±0.01μs     1.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
+     6.50±0.04μs       13.4±0.9μs     2.06  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
+      11.4±0.2μs       16.5±0.9μs     1.45  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
+     4.25±0.01μs      6.82±0.01μs     1.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
+     6.47±0.01μs       13.5±0.9μs     2.08  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
+      11.2±0.4μs         16.4±1μs     1.46  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
+     3.08±0.02μs      3.39±0.03μs     1.10  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
+     6.55±0.01μs       13.1±0.7μs     1.99  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
+      11.8±0.3μs       16.0±0.7μs     1.36  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
+     4.26±0.01μs      6.81±0.01μs     1.60  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
+     6.51±0.02μs       13.5±0.8μs     2.07  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
+      11.5±0.1μs         16.5±1μs     1.43  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
+     4.30±0.01μs      6.82±0.01μs     1.59  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
+     6.51±0.03μs       13.4±0.9μs     2.06  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
+      11.5±0.2μs         16.4±1μs     1.42  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-      43.8±0.2μs       34.2±0.4μs     0.78  bench_ufunc.CustomInplace.time_double_add
-      50.7±0.3μs       42.0±0.2μs     0.83  bench_ufunc.CustomInplace.time_double_add_temp
-      46.8±0.6μs       36.0±0.1μs     0.77  bench_ufunc.CustomInplace.time_float_add
-      53.4±0.3μs       43.3±0.3μs     0.81  bench_ufunc.CustomInplace.time_float_add_temp
-      45.7±0.4μs       40.1±0.4μs     0.88  bench_ufunc.CustomInplace.time_int_or
-      53.3±0.4μs       46.3±0.5μs     0.87  bench_ufunc.CustomInplace.time_int_or_temp
-     4.37±0.02μs      3.63±0.01μs     0.83  bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float32'>)
-     6.78±0.06μs      5.40±0.04μs     0.80  bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float64'>)
-     5.10±0.01μs      4.07±0.03μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
-     5.11±0.02μs      4.07±0.04μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
-     5.10±0.02μs      4.07±0.02μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
-     5.10±0.02μs      4.06±0.02μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-     8.67±0.01μs         6.34±0μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
-     8.67±0.04μs      6.34±0.01μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
-     8.69±0.02μs      6.34±0.01μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
-     8.68±0.04μs         6.34±0μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
-     3.35±0.01μs      2.76±0.04μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
-     3.35±0.01μs      2.76±0.04μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
-        3.36±2μs      2.76±0.04μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
-     3.35±0.02μs      2.76±0.03μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
-     3.95±0.03μs      3.27±0.02μs     0.83  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-     3.93±0.02μs      3.28±0.02μs     0.83  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-     5.78±0.01μs      4.37±0.01μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-     5.78±0.01μs      4.37±0.01μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-     3.04±0.01μs      2.69±0.01μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-     3.05±0.01μs      2.69±0.01μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-        426±10μs          373±7μs     0.87  bench_ufunc.UFunc.time_ufunc_types('rad2deg')
+     7.30±0.02μs      9.34±0.02μs     1.28  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('add', 1, 'F')
+     7.56±0.05μs       10.2±0.2μs     1.35  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('add', 2, 'F')
+      11.6±0.1μs       20.0±0.1μs     1.72  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 1, 'D')
+      14.3±0.1μs       23.2±0.1μs     1.61  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 2, 'D')
+      9.66±0.2μs      11.6±0.08μs     1.20  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 2, 'F')
+      21.9±0.1μs       29.5±0.2μs     1.35  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 4, 'D')
+     7.32±0.03μs      9.45±0.03μs     1.29  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 1, 'F')
+     7.57±0.05μs      10.3±0.08μs     1.36  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 2, 'F')
-      10.1±0.1μs       8.96±0.3μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'H')
-      42.1±0.1μs       38.2±0.1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'L')
-      42.1±0.2μs       38.1±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'Q')
-      10.0±0.2μs       9.10±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'h')
-      42.3±0.2μs       38.3±0.4μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'l')
-      10.2±0.3μs       9.01±0.1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'H')
-      42.0±0.2μs       38.0±0.3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'L')
-      42.2±0.3μs       38.2±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'l')
-      42.1±0.2μs       38.2±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'q')
-      62.4±0.5μs       33.8±0.1μs     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 1, 'f')
-      81.8±0.2μs       51.8±0.2μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 2, 'f')
-      88.6±0.9μs       66.5±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 4, 'f')
-      25.6±0.2μs       22.3±0.5μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'f')
-      47.8±0.3μs       43.1±0.5μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'f')
-      66.8±0.4μs       33.7±0.1μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 1, 'f')
-      82.3±0.1μs       51.0±0.2μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 2, 'f')
-        88.9±1μs       66.3±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
+      44.5±0.5μs       51.3±0.2μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 1, 'd')
-      47.9±0.5μs       42.9±0.4μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 1, 'f')
-      50.6±0.3μs       45.8±0.2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 2, 'f')
-      50.8±0.3μs       43.7±0.1μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 1, 'f')
-      51.5±0.6μs       46.3±0.4μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 1, 'e')
-      50.1±0.5μs       44.9±0.2μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 2, 'e')
+      44.3±0.2μs       51.3±0.4μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 1, 'd')
-        47.8±1μs       42.9±0.2μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 1, 'f')
-      50.7±0.2μs       45.7±0.3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 2, 'f')
-      50.7±0.3μs       43.8±0.2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 1, 'f')
-      50.0±0.6μs       45.3±0.4μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 2, 'e')
-       232±0.7μs        169±0.5μs     0.73  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'f')
-       285±0.6μs        203±0.4μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 2, 'f')
-         292±4μs          208±3μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 4, 'f')
-         260±1μs        192±0.2μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 1, 'f')
-         319±2μs          241±2μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 2, 'f')
-         318±1μs        239±0.3μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 4, 'f')
-       306±0.9μs        192±0.3μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 1, 'f')
-       338±0.5μs          240±1μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 2, 'f')
-         338±6μs          242±4μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 4, 'f')
-      26.0±0.3μs       22.2±0.4μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'f')
-      47.6±0.2μs       42.8±0.6μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'f')
-      66.8±0.8μs       33.7±0.1μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'f')
-      82.3±0.1μs       51.1±0.2μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f')
-        88.7±1μs      66.3±0.09μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'f')
-       144±0.6μs        128±0.8μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 1, 2, 'f')
-         147±2μs          134±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 1, 4, 'f')
-         143±1μs          129±1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 2, 'f')
-       144±0.6μs        129±0.3μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 2, 'f')
+      44.2±0.5μs       51.3±0.3μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'd')
-      48.1±0.7μs       42.9±0.2μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'f')
-      50.8±0.4μs       45.7±0.1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'f')
-      50.5±0.4μs       43.8±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 1, 'f')
-      52.1±0.2μs       43.2±0.7μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 2, 'f')
-         116±4μs          104±3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'd')
-      62.0±0.7μs       33.2±0.1μs     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 1, 'f')
-      84.1±0.2μs       52.3±0.3μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 2, 'f')
-      90.1±0.9μs       66.6±0.2μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 4, 'f')
-      25.9±0.2μs       22.0±0.6μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'f')
-      47.8±0.4μs       42.6±0.6μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 2, 'f')
-      66.7±0.4μs       33.7±0.1μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 1, 'f')
-      82.3±0.1μs       51.0±0.2μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 2, 'f')
-      88.8±0.9μs       66.2±0.3μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 4, 'f')
-       119±0.5μs       80.4±0.5μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 1, 'd')
+      79.7±0.2μs        118±0.3μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 1, 'f')
-         118±2μs       83.8±0.3μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'd')
+      80.3±0.2μs        119±0.4μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'f')
-         125±3μs          108±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'd')
+        82.1±1μs          121±2μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'f')
-       120±0.7μs         81.8±1μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'd')
+      80.2±0.2μs        118±0.3μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'f')
-       118±0.2μs       87.9±0.3μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 2, 'd')
+      80.6±0.6μs        119±0.8μs     1.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 2, 'f')
+      83.7±0.2μs        118±0.2μs     1.41  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 4, 'f')
-         126±4μs         93.3±7μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 1, 'd')
+      81.4±0.7μs        120±0.3μs     1.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 1, 'f')
-         120±1μs        104±0.4μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'd')
+      80.9±0.2μs        118±0.1μs     1.46  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'f')
+        88.5±1μs          118±2μs     1.34  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 4, 'f')
-       225±0.6μs        153±0.9μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'f')
-       277±0.8μs        194±0.4μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 2, 'f')
-         283±4μs          197±3μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 4, 'f')
-       250±0.9μs        207±0.6μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 1, 'f')
-         306±2μs          241±2μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'f')
-       305±0.5μs        241±0.6μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'f')
-         291±1μs        207±0.6μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 1, 'f')
-       332±0.5μs          241±3μs     0.73  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 2, 'f')
-         333±5μs          241±4μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'f')
-      60.5±0.3μs       39.0±0.4μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 2, 'f')
-      60.7±0.8μs         53.6±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 4, 'f')
-      54.2±0.3μs       43.4±0.4μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 2, 'f')
-      62.5±0.9μs       36.6±0.1μs     0.58  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 1, 'f')
-      83.0±0.2μs       52.8±0.1μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 2, 'f')
-        88.8±1μs       66.4±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 4, 'f')
-      25.1±0.2μs       21.9±0.3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 1, 'f')
-      47.7±0.3μs       43.1±0.5μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 2, 'f')
-      64.9±0.3μs       33.7±0.1μs     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 1, 'f')
-      82.5±0.1μs       52.0±0.1μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 2, 'f')
-        88.9±1μs       66.6±0.4μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 4, 'f')
-         815±6μs          594±4μs     0.73  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-        854±10μs          615±6μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-       321±0.7μs          256±1μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-        896±20μs         644±20μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-         329±4μs          261±4μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-         849±5μs          646±2μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-         883±8μs          676±1μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-         338±2μs          290±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-        895±30μs         678±20μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-       337±0.5μs        289±0.9μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-         849±5μs          652±3μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-       302±0.6μs        248±0.5μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-         887±7μs          683±3μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-       352±0.7μs          291±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-         883±8μs          683±3μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-         353±5μs          292±4μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')
-      26.0±0.2μs       22.3±0.3μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 1, 'f')
-      47.6±0.2μs       42.8±0.3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 2, 'f')
-      67.4±0.7μs       33.5±0.2μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 1, 'f')
-      82.3±0.2μs       51.0±0.1μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 2, 'f')
-        88.7±1μs       66.3±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'f')

Mousius · 2023-05-18T14:00:57Z

I would suggest to postponed your current work for 1-2 months till we get done from #2105, then refactoring this work together, and rewriting the C SIMD kernels intrinsics in C++ wouldn't be that hard. What do you think?

@seiko2plus, we're now past the timeframe suggested here, is there any way to unblock this PR? It'd be great to be able to leverage SVE as we migrate the other routines to universal intrinsics.

github-actions bot added the 01 - Enhancement label Sep 15, 2022

kawakami-k mentioned this pull request Sep 15, 2022

ENH: Add CPU feature detection for SVE2 #21638

Open

kawakami-k force-pushed the sve_enablement branch 2 times, most recently from 8c8425a to c9dd736 Compare September 16, 2022 00:05

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 17, 2022

EwoutH mentioned this pull request Sep 29, 2022

ENH: Add CPU feature detection for Intel AMX (Advanced Matrix Extensions) #22355

Open

kawakami-k added 6 commits October 3, 2022 20:58

ENH: Add SVE flags and args

26af068

ENH, SIMD: Add SVE intrinsics

693861a

ENH, SIMD: Add SVE simd functions

b910144

TST: Add test for SVE

cb3c08f

ENH, SIMD: fix coding style

a3fee10

ENH: SIMD: rebase to 86cd584

a200644

kawakami-k force-pushed the sve_enablement branch from 6cf0329 to a200644 Compare October 3, 2022 15:05

ENH: SIMD: add support for SVE 256

40246bc

yamadafuyuka mentioned this pull request Jan 23, 2023

ENH: Add support SLEEF for transcendental functions #23068

Open

ENH, SIMD: Add SVE intrinsics

99ca3cd

kawakami-k force-pushed the sve_enablement branch from 6694061 to 99ca3cd Compare April 12, 2023 00:53

Mousius mentioned this pull request Jun 29, 2023

ENH: Use Highway's VQSort on AArch64 #24018

Merged

Mousius mentioned this pull request Dec 27, 2023

DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not) #24138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265

ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265

Uh oh!

kawakami-k commented Sep 15, 2022

Uh oh!

EwoutH commented Sep 16, 2022

Uh oh!

seiko2plus commented Sep 17, 2022

Uh oh!

kawakami-k commented Sep 18, 2022 •

edited

Loading

Uh oh!

kawakami-k commented Sep 18, 2022

Uh oh!

seiko2plus commented Sep 18, 2022

Uh oh!

rgommers commented Sep 20, 2022

Uh oh!

kawakami-k commented Sep 21, 2022 •

edited

Loading

Uh oh!

kawakami-k commented Sep 21, 2022 •

edited

Loading

Uh oh!

EwoutH commented Sep 29, 2022

Uh oh!

kawakami-k commented Oct 4, 2022 •

edited

Loading

Uh oh!

Mousius commented May 18, 2023

Uh oh!

Uh oh!

Uh oh!

ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265

Are you sure you want to change the base?

ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265

Uh oh!

Conversation

kawakami-k commented Sep 15, 2022

Uh oh!

EwoutH commented Sep 16, 2022

Uh oh!

seiko2plus commented Sep 17, 2022

Uh oh!

kawakami-k commented Sep 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kawakami-k commented Sep 18, 2022

Uh oh!

seiko2plus commented Sep 18, 2022

Uh oh!

rgommers commented Sep 20, 2022

Uh oh!

kawakami-k commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kawakami-k commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EwoutH commented Sep 29, 2022

Uh oh!

kawakami-k commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mousius commented May 18, 2023

Uh oh!

Uh oh!

kawakami-k commented Sep 18, 2022 •

edited

Loading

kawakami-k commented Sep 21, 2022 •

edited

Loading

kawakami-k commented Sep 21, 2022 •

edited

Loading

kawakami-k commented Oct 4, 2022 •

edited

Loading