Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kawakami-k
Copy link
Contributor

This PR enhances the CPU feature detection function so as to detect Arm SVE architecture. It also includes vectorized functions for SVE that is implemented similarly to AVX/AVX2/ASIMD. The regression test (runtest.py) was executed on Fujtisu FX700 with A64FX which is one of Armv8.2a + SVE architecture compliant CPU. The result was "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 xpassed".

Because SVE2 (Armv9) is upper-compatible instruction set of SVE, I believe this PR also improves NumPy preformance running on SVE2 environment.

@EwoutH
Copy link
Contributor

EwoutH commented Sep 16, 2022

Thanks a lot for this effort!

Do you by any chance have any performance benchmarks? (see maybe benchmark docs)

@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 17, 2022
@seiko2plus
Copy link
Member

Compile-time sizeless SIMD extensions should be treated as it designed. providing compiled objects for each possible width (256, 512, 1024, 2048) going to increase the binary size and maintenance efforts. note the current SVE implementation only supports 512bit width.

IMHO, it's better to wait till we get done with the C++ interface of universal intrinsics which is designed to support sizeless SIMD extensions(#21057) since we are moving to C++ anyway. However, we still can modify the C interface to make it friendly with sizeless SIMD extensions. thoughts? @charris, @rgommers, @mattip

@kawakami-k
Copy link
Contributor Author

kawakami-k commented Sep 18, 2022

Hi, @seiko2plus

Thank you for the comment.

When I tried to implement in a size-less manner, I couldn't implement the following part (typedef union "simd_data"). If this part can be solved, I think all other parts (core/src/common/simd/sve/(conversion|memory).h) can be done in a size-less manner.
https://github.com/kawakami-k/numpy/blob/sve_enablement/numpy/core/src/_simd/_simd_inc.h.src#L14-L56
https://github.com/kawakami-k/numpy/blob/sve_enablement/numpy/core/src/common/simd/sve/sve.h#L5-L12

@kawakami-k
Copy link
Contributor Author

Hi, @EwoutH

In my environment (512bit SVE), I've confirmed more than three times faster performance gain depending on the type of benchmark. I've also observed a performance drop of a few percent on some benchmarks. I need in-company confirmation to disclose absolute processing time of the benchmark. Could you give me a week of time?

@seiko2plus
Copy link
Member

Hi @kawakami-k,

The _simd module (testing proposes) has been reimplemented in C++ and
has become fully friendly with sizeless SIMD. check the following path(part of #2105): https://github.com/numpy/numpy/tree/efa6ebea6f88c64bcdd5b8d492c13c9cc30536d2/numpy/core/src/_simd

I would suggest to postponed your current work for 1-2 months till we get done from #2105,
then refactoring this work together, and rewriting the C SIMD kernels intrinsics in C++ wouldn't be that hard.
What do you think?

@rgommers
Copy link
Member

I need in-company confirmation to disclose absolute processing time of the benchmark. Could you give me a week of time?

In case you don't get permission to post absolute numbers, you could perhaps take the output of python runtests.py --bench-compare and edit it to show only the relative speedups.

@kawakami-k
Copy link
Contributor Author

kawakami-k commented Sep 21, 2022

Hi, @rgommers

In case you don't get permission to post absolute numbers, you could perhaps take the output of python runtests.py --bench-compare and edit it to show only the relative speedups.

Thank you for the comment. 1) One idea is to disclose the relative performance. 2) As another idea, I'm preparing to run benchmarks on AWS Graviton3 (SVE 256bit). For now, I'm going to do with 2).

Since I will not have much time in September, I will measure and compare benchmark in early October. Thank you.

@kawakami-k
Copy link
Contributor Author

kawakami-k commented Sep 21, 2022

Hi, @seiko2plus

I would suggest to postponed your current work for 1-2 months till we get done from #2105, then refactoring this work together, and rewriting the C SIMD kernels intrinsics in C++ wouldn't be that hard. What do you think?

I haven't had time to understand the new C++ interface proposal, but it's no problem to modify this PR to fit the new interface. Thank you.

@EwoutH
Copy link
Contributor

EwoutH commented Sep 29, 2022

In my environment (512bit SVE), I've confirmed more than three times faster performance gain depending on the type of benchmark.

This sounds really awesome! Are you able to share any numbers?

If absolute figures (seconds) aren't possible, you could also share speedups compared to NEON or to plain C (so 2.39x speed for example)

@kawakami-k
Copy link
Contributor Author

kawakami-k commented Oct 4, 2022

@EwoutH @rgommers

The below is the benchmark result on AWS Graviton3 (SVE 256).
Only results with significant differences are extracted.

python runtests.py --cpu-baseline=sve --cpu-dispatch=none --bench-compare sve_enablement256
python runtests.py --bench-compare main
cd benchmarks
asv compare 86cd584b ffe9cf2c | egrep -e "^-" -e "^\+"

The source code I used is 86cd584b and ffe9cf2c.

The implementation has been changed to be as SIMD size-independent as possible,
but the explicit size specification to the compiler is required.
https://github.com/kawakami-k/numpy/blob/sve_enablement256/numpy/distutils/ccompiler_opt.py#L541

All benchmarks:

       before           after         ratio
     [86cd584b]       [ffe9cf2c]
     <main>           <sve_enablement>
-        4.81±0μs      4.30±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(1, 10000, <class 'numpy.int64'>)
-      51.1±0.1μs       44.6±0.2μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)
-      98.9±0.2μs       86.0±0.3μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int16'>)
-       196±0.2μs        172±0.3μs     0.88  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int32'>)
-         391±2μs          340±3μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int64'>)
-      51.0±0.2μs       44.3±0.3μs     0.87  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
-     2.81±0.01μs         2.53±0μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int16'>)
-        4.81±0μs         4.31±0μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int32'>)
-     8.68±0.01μs      7.67±0.01μs     0.88  bench_core.CountNonzero.time_count_nonzero(2, 10000, <class 'numpy.int64'>)
-       103±0.1μs       86.1±0.3μs     0.83  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
-       196±0.6μs          172±1μs     0.88  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int16'>)
-         390±1μs          345±3μs     0.88  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int32'>)
-         803±9μs         723±10μs     0.90  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int64'>)
-       103±0.2μs       85.9±0.2μs     0.83  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
-     3.82±0.01μs      3.41±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int16'>)
-     6.74±0.01μs      5.99±0.01μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int32'>)
-     12.5±0.02μs      11.0±0.02μs     0.88  bench_core.CountNonzero.time_count_nonzero(3, 10000, <class 'numpy.int64'>)
-       155±0.4μs        129±0.3μs     0.83  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
-         294±2μs          257±1μs     0.87  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int16'>)
-         586±1μs          522±5μs     0.89  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int32'>)
-     1.29±0.02ms      1.13±0.04ms     0.88  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int64'>)
-       155±0.4μs        129±0.3μs     0.83  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
-     10.7±0.01μs         8.45±0μs     0.79  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'bool'>)
-     12.2±0.02μs      9.66±0.01μs     0.79  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int16'>)
-     12.9±0.02μs      10.4±0.08μs     0.80  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int32'>)
-     14.6±0.03μs      11.5±0.02μs     0.79  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int64'>)
-     11.7±0.01μs      9.48±0.03μs     0.81  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int8'>)
-      31.8±0.9μs      28.3±0.06μs     0.89  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'str'>)
-         651±5μs        413±0.7μs     0.63  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'bool'>)
-         727±2μs        492±0.9μs     0.68  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int16'>)
-         815±3μs          570±2μs     0.70  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int32'>)
-         964±5μs          699±1μs     0.72  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int64'>)
-         684±2μs        453±0.6μs     0.66  bench_core.CountNonzero.time_count_nonzero_axis(1, 1000000, <class 'numpy.int8'>)
-     17.2±0.03μs      12.5±0.02μs     0.73  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'bool'>)
-     19.7±0.03μs      14.5±0.01μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int16'>)
-      21.1±0.1μs      15.7±0.01μs     0.75  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int32'>)
-     24.1±0.04μs      18.3±0.05μs     0.76  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int64'>)
-     18.7±0.03μs      13.9±0.01μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'numpy.int8'>)
-     1.31±0.01ms          826±2μs     0.63  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'bool'>)
-     1.49±0.02ms         1.00±0ms     0.67  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int16'>)
-     1.67±0.02ms         1.15±0ms     0.69  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int32'>)
-     1.99±0.02ms         1.43±0ms     0.72  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int64'>)
-     1.41±0.02ms          920±2μs     0.65  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int8'>)
-     23.6±0.02μs      16.6±0.03μs     0.70  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'bool'>)
-     26.9±0.02μs      19.2±0.02μs     0.71  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int16'>)
-      29.0±0.2μs      21.1±0.01μs     0.73  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int32'>)
-      33.4±0.3μs       24.9±0.1μs     0.74  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int64'>)
-     25.5±0.04μs      18.4±0.05μs     0.72  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int8'>)
-        1.98±0ms         1.23±0ms     0.62  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'bool'>)
-     2.28±0.02ms         1.50±0ms     0.66  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int16'>)
-     2.54±0.02ms         1.72±0ms     0.68  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int32'>)
-     3.05±0.02ms      2.16±0.01ms     0.71  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int64'>)
-     2.15±0.02ms         1.39±0ms     0.64  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'numpy.int8'>)
-     11.8±0.02ms      10.6±0.06ms     0.90  bench_core.CountNonzero.time_count_nonzero_axis(3, 1000000, <class 'str'>)
-     11.0±0.02μs      8.79±0.02μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'bool'>)
-     12.5±0.06μs      10.1±0.01μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int16'>)
-     13.2±0.02μs      10.6±0.03μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int32'>)
-     14.9±0.03μs      11.9±0.02μs     0.80  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int64'>)
-     12.0±0.02μs      9.84±0.04μs     0.82  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int8'>)
-      32.1±0.8μs       28.9±0.1μs     0.90  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'str'>)
-         655±3μs        412±0.5μs     0.63  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'bool'>)
-         728±4μs        494±0.9μs     0.68  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int16'>)
-         813±4μs        572±0.8μs     0.70  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int32'>)
-         964±5μs          703±1μs     0.73  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int64'>)
-         684±3μs        455±0.9μs     0.67  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 1000000, <class 'numpy.int8'>)
-     17.3±0.02μs      12.7±0.02μs     0.73  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'bool'>)
-     19.8±0.03μs      14.8±0.07μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int16'>)
-     21.2±0.03μs      16.0±0.04μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int32'>)
-     24.2±0.09μs       18.5±0.1μs     0.77  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int64'>)
-     18.8±0.02μs      14.1±0.03μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int8'>)
-     1.31±0.01ms          821±2μs     0.63  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'bool'>)
-     1.51±0.01ms          998±1μs     0.66  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int16'>)
-     1.69±0.01ms         1.15±0ms     0.68  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int32'>)
-     1.99±0.01ms      1.42±0.01ms     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int64'>)
-     1.42±0.01ms        917±0.8μs     0.65  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'numpy.int8'>)
-     23.6±0.01μs      16.7±0.04μs     0.71  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'bool'>)
-     26.9±0.02μs      19.4±0.06μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int16'>)
-      28.9±0.2μs      21.4±0.03μs     0.74  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int32'>)
-      33.6±0.2μs       25.3±0.4μs     0.75  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int64'>)
-     25.5±0.03μs      18.4±0.02μs     0.72  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 10000, <class 'numpy.int8'>)
-     1.98±0.01ms         1.23±0ms     0.62  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'bool'>)
-     2.27±0.01ms         1.49±0ms     0.66  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int16'>)
-     2.54±0.01ms         1.72±0ms     0.68  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int32'>)
-     3.04±0.01ms      2.15±0.01ms     0.71  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int64'>)
-     2.15±0.02ms         1.38±0ms     0.64  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int8'>)
-     11.7±0.06ms      10.6±0.07ms     0.90  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'str'>)
+        1.75±0μs      2.95±0.03μs     1.69  bench_core.PackBits.time_packbits(<class 'bool'>)
+     19.4±0.02μs      29.0±0.05μs     1.50  bench_core.PackBits.time_packbits(<class 'numpy.uint64'>)
+         367±2μs        573±0.6μs     1.56  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
+        433±20μs          605±2μs     1.40  bench_core.PackBits.time_packbits_axis0(<class 'numpy.uint64'>)
+     18.4±0.02μs      45.3±0.09μs     2.47  bench_core.PackBits.time_packbits_axis1(<class 'bool'>)
+       370±0.2μs        561±0.5μs     1.52  bench_core.PackBits.time_packbits_axis1(<class 'numpy.uint64'>)
+     1.95±0.01μs      3.06±0.01μs     1.57  bench_core.PackBits.time_packbits_little(<class 'bool'>)
+     19.7±0.02μs      28.4±0.03μs     1.44  bench_core.PackBits.time_packbits_little(<class 'numpy.uint64'>)
-         817±4μs         704±10μs     0.86  bench_core.Temporaries.time_large
-      43.5±0.1μs       36.3±0.5μs     0.83  bench_core.Temporaries.time_mid
-      43.1±0.3μs      35.8±0.09μs     0.83  bench_core.Temporaries.time_mid2
-     7.12±0.01μs      6.32±0.04μs     0.89  bench_core.UnpackBits.time_unpackbits_little
+       182±0.3μs          205±2μs     1.12  bench_function_base.Sort.time_argsort('merge', 'float32', ('sorted_block', 10))
-      78.9±0.1μs      69.1±0.07μs     0.88  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
-     47.2±0.07μs       39.9±0.1μs     0.85  bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
-      79.5±0.1μs       70.0±0.1μs     0.88  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
-      47.4±0.1μs      40.4±0.04μs     0.85  bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
+     69.1±0.08μs      79.1±0.06μs     1.14  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+     39.9±0.06μs      47.1±0.04μs     1.18  bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+         400±1μs          481±1μs     1.20  bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))
+       453±0.9μs        514±0.7μs     1.13  bench_function_base.Sort.time_sort('heap', 'float32', ('reversed',))
+         552±1μs        636±0.6μs     1.15  bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 10))
+         613±1μs        869±300μs     1.42  bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
-         624±3μs          553±2μs     0.89  bench_function_base.Sort.time_sort('merge', 'float64', ('random',))
+      67.7±0.1μs         115±50μs     1.70  bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))
-     7.14±0.01μs      6.40±0.02μs     0.90  bench_function_base.Where.time_all_zeros
+     5.95±0.01μs      6.69±0.03μs     1.12  bench_indexing.ScalarIndexing.time_assign(0)
-         527±3μs          474±4μs     0.90  bench_lib.Nan.time_nanmean(200000, 0)
-         530±3μs          475±5μs     0.90  bench_lib.Nan.time_nanmean(200000, 0.1)
-     1.12±0.03ms      1.01±0.03ms     0.90  bench_lib.Pad.time_pad((1024, 1024), (0, 32), 'mean')
-        1.13±0ms         1.01±0ms     0.90  bench_lib.Pad.time_pad((1024, 1024), 1, 'mean')
-     1.09±0.04ms         970±30μs     0.89  bench_lib.Pad.time_pad((1024, 1024), 8, 'mean')
-       154±0.5μs        137±0.6μs     0.89  bench_lib.Pad.time_pad((4, 4, 4, 4), 8, 'constant')
-      61.2±0.7μs         49.3±1μs     0.81  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
-       108±0.5μs         79.2±3μs     0.73  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
-     43.2±0.05μs       39.0±0.1μs     0.90  bench_linalg.Linalg.time_op('norm', 'float16')
-     26.0±0.08μs       22.3±0.2μs     0.86  bench_random.Choice.time_choice(1000.0)
-       390±0.4μs        272±0.7μs     0.70  bench_reduce.AddReduceSeparate.time_reduce(0, 'float64')
-         553±9μs          443±1μs     0.80  bench_reduce.AddReduceSeparate.time_reduce(0, 'int16')
-         576±2μs        464±0.8μs     0.81  bench_reduce.AddReduceSeparate.time_reduce(0, 'int32')
-       339±0.5μs        280±0.4μs     0.83  bench_reduce.AddReduceSeparate.time_reduce(0, 'int64')
-        2.02±0ms         1.84±0ms     0.91  bench_reduce.AddReduceSeparate.time_reduce(1, 'float16')
-         625±1μs          403±2μs     0.64  bench_reduce.AddReduceSeparate.time_reduce(1, 'int16')
-         598±6μs          423±1μs     0.71  bench_reduce.AddReduceSeparate.time_reduce(1, 'int32')
-       385±0.2μs        233±0.3μs     0.61  bench_reduce.AddReduceSeparate.time_reduce(1, 'int64')
-      5.52±0.3μs      5.00±0.05μs     0.91  bench_reduce.AnyAll.time_any_slow
+     7.87±0.01μs      14.9±0.08μs     1.89  bench_reduce.ArgMax.time_argmax(<class 'bool'>)
-       122±0.2μs         69.1±5μs     0.56  bench_reduce.ArgMax.time_argmax(<class 'numpy.int64'>)
-     9.86±0.06μs      8.77±0.04μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'numpy.int8'>)
-       122±0.2μs         69.1±5μs     0.57  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint64'>)
-     9.87±0.03μs      8.83±0.06μs     0.89  bench_reduce.ArgMax.time_argmax(<class 'numpy.uint8'>)
-      93.5±0.1μs       81.3±0.3μs     0.87  bench_reduce.ArgMin.time_argmin(<class 'numpy.float64'>)
-      33.8±0.2μs       29.8±0.8μs     0.88  bench_reduce.ArgMin.time_argmin(<class 'numpy.int32'>)
-       119±0.2μs         59.6±1μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.int64'>)
-     9.85±0.02μs      8.81±0.01μs     0.89  bench_reduce.ArgMin.time_argmin(<class 'numpy.int8'>)
-      17.6±0.1μs      15.7±0.09μs     0.89  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint16'>)
-      33.6±0.2μs       29.5±0.9μs     0.88  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint32'>)
-       119±0.1μs         59.1±1μs     0.50  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint64'>)
-     9.84±0.01μs      8.72±0.01μs     0.89  bench_reduce.ArgMin.time_argmin(<class 'numpy.uint8'>)
-     7.35±0.01μs      6.18±0.01μs     0.84  bench_reduce.MinMax.time_max(<class 'numpy.int64'> (0))
-     7.36±0.04μs      6.18±0.01μs     0.84  bench_reduce.MinMax.time_max(<class 'numpy.int64'> (1))
-        7.36±0μs      6.18±0.01μs     0.84  bench_reduce.MinMax.time_max(<class 'numpy.uint64'>)
-     7.36±0.05μs      6.20±0.02μs     0.84  bench_reduce.MinMax.time_min(<class 'numpy.int64'> (0))
-       186±0.7ms          166±1ms     0.89  bench_shape_base.Kron.time_arr_kron
-       115±0.3ms          104±2ms     0.91  bench_shape_base.Kron.time_mat_kron
+     1.52±0.01μs      1.76±0.09μs     1.16  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '!=')
+      8.57±0.1μs       14.5±0.5μs     1.69  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
+     17.3±0.09μs       20.8±0.5μs     1.20  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
+     4.87±0.02μs       6.92±0.2μs     1.42  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
+      8.54±0.1μs       14.0±0.5μs     1.64  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
+      17.3±0.1μs         21.1±1μs     1.22  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
+     4.87±0.02μs       6.95±0.2μs     1.43  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
+      8.54±0.1μs       13.4±0.5μs     1.57  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
+      17.2±0.1μs         21.4±1μs     1.25  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
+     6.53±0.01μs       13.0±0.8μs     1.99  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
+      11.5±0.2μs       16.1±0.4μs     1.40  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
+     4.26±0.01μs      6.82±0.01μs     1.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
+     6.50±0.04μs       13.4±0.9μs     2.06  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
+      11.4±0.2μs       16.5±0.9μs     1.45  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
+     4.25±0.01μs      6.82±0.01μs     1.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
+     6.47±0.01μs       13.5±0.9μs     2.08  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
+      11.2±0.4μs         16.4±1μs     1.46  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
+     3.08±0.02μs      3.39±0.03μs     1.10  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
+     6.55±0.01μs       13.1±0.7μs     1.99  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
+      11.8±0.3μs       16.0±0.7μs     1.36  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
+     4.26±0.01μs      6.81±0.01μs     1.60  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
+     6.51±0.02μs       13.5±0.8μs     2.07  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
+      11.5±0.1μs         16.5±1μs     1.43  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
+     4.30±0.01μs      6.82±0.01μs     1.59  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
+     6.51±0.03μs       13.4±0.9μs     2.06  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
+      11.5±0.2μs         16.4±1μs     1.42  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-      43.8±0.2μs       34.2±0.4μs     0.78  bench_ufunc.CustomInplace.time_double_add
-      50.7±0.3μs       42.0±0.2μs     0.83  bench_ufunc.CustomInplace.time_double_add_temp
-      46.8±0.6μs       36.0±0.1μs     0.77  bench_ufunc.CustomInplace.time_float_add
-      53.4±0.3μs       43.3±0.3μs     0.81  bench_ufunc.CustomInplace.time_float_add_temp
-      45.7±0.4μs       40.1±0.4μs     0.88  bench_ufunc.CustomInplace.time_int_or
-      53.3±0.4μs       46.3±0.5μs     0.87  bench_ufunc.CustomInplace.time_int_or_temp
-     4.37±0.02μs      3.63±0.01μs     0.83  bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float32'>)
-     6.78±0.06μs      5.40±0.04μs     0.80  bench_ufunc.CustomScalar.time_add_scalar2(<class 'numpy.float64'>)
-     5.10±0.01μs      4.07±0.03μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
-     5.11±0.02μs      4.07±0.04μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
-     5.10±0.02μs      4.07±0.02μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
-     5.10±0.02μs      4.06±0.02μs     0.80  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-     8.67±0.01μs         6.34±0μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
-     8.67±0.04μs      6.34±0.01μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
-     8.69±0.02μs      6.34±0.01μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
-     8.68±0.04μs         6.34±0μs     0.73  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
-     3.35±0.01μs      2.76±0.04μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
-     3.35±0.01μs      2.76±0.04μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
-        3.36±2μs      2.76±0.04μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
-     3.35±0.02μs      2.76±0.03μs     0.82  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
-     3.95±0.03μs      3.27±0.02μs     0.83  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 43)
-     3.93±0.02μs      3.28±0.02μs     0.83  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 8)
-     5.78±0.01μs      4.37±0.01μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 43)
-     5.78±0.01μs      4.37±0.01μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 8)
-     3.04±0.01μs      2.69±0.01μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 43)
-     3.05±0.01μs      2.69±0.01μs     0.88  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 8)
-        426±10μs          373±7μs     0.87  bench_ufunc.UFunc.time_ufunc_types('rad2deg')
+     7.30±0.02μs      9.34±0.02μs     1.28  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('add', 1, 'F')
+     7.56±0.05μs       10.2±0.2μs     1.35  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('add', 2, 'F')
+      11.6±0.1μs       20.0±0.1μs     1.72  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 1, 'D')
+      14.3±0.1μs       23.2±0.1μs     1.61  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 2, 'D')
+      9.66±0.2μs      11.6±0.08μs     1.20  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 2, 'F')
+      21.9±0.1μs       29.5±0.2μs     1.35  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 4, 'D')
+     7.32±0.03μs      9.45±0.03μs     1.29  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 1, 'F')
+     7.57±0.05μs      10.3±0.08μs     1.36  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('subtract', 2, 'F')
-      10.1±0.1μs       8.96±0.3μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'H')
-      42.1±0.1μs       38.2±0.1μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'L')
-      42.1±0.2μs       38.1±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'Q')
-      10.0±0.2μs       9.10±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'h')
-      42.3±0.2μs       38.3±0.4μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 1, 'l')
-      10.2±0.3μs       9.01±0.1μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'H')
-      42.0±0.2μs       38.0±0.3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'L')
-      42.2±0.3μs       38.2±0.3μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'l')
-      42.1±0.2μs       38.2±0.2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 1, 1, 'q')
-      62.4±0.5μs       33.8±0.1μs     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 1, 'f')
-      81.8±0.2μs       51.8±0.2μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 2, 'f')
-      88.6±0.9μs       66.5±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 4, 4, 'f')
-      25.6±0.2μs       22.3±0.5μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'f')
-      47.8±0.3μs       43.1±0.5μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'f')
-      66.8±0.4μs       33.7±0.1μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 1, 'f')
-      82.3±0.1μs       51.0±0.2μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 2, 'f')
-        88.9±1μs       66.3±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
+      44.5±0.5μs       51.3±0.2μs     1.15  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 1, 'd')
-      47.9±0.5μs       42.9±0.4μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 1, 'f')
-      50.6±0.3μs       45.8±0.2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 1, 2, 'f')
-      50.8±0.3μs       43.7±0.1μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 2, 1, 'f')
-      51.5±0.6μs       46.3±0.4μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 1, 'e')
-      50.1±0.5μs       44.9±0.2μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (0), 4, 2, 'e')
+      44.3±0.2μs       51.3±0.4μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 1, 'd')
-        47.8±1μs       42.9±0.2μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 1, 'f')
-      50.7±0.2μs       45.7±0.3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 1, 2, 'f')
-      50.7±0.3μs       43.8±0.2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 2, 1, 'f')
-      50.0±0.6μs       45.3±0.4μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'> (1), 4, 2, 'e')
-       232±0.7μs        169±0.5μs     0.73  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 1, 'f')
-       285±0.6μs        203±0.4μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 2, 'f')
-         292±4μs          208±3μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 1, 4, 'f')
-         260±1μs        192±0.2μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 1, 'f')
-         319±2μs          241±2μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 2, 'f')
-         318±1μs        239±0.3μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 4, 'f')
-       306±0.9μs        192±0.3μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 1, 'f')
-       338±0.5μs          240±1μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 2, 'f')
-         338±6μs          242±4μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 4, 'f')
-      26.0±0.3μs       22.2±0.4μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'f')
-      47.6±0.2μs       42.8±0.6μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'f')
-      66.8±0.8μs       33.7±0.1μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'f')
-      82.3±0.1μs       51.1±0.2μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f')
-        88.7±1μs      66.3±0.09μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'f')
-       144±0.6μs        128±0.8μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 1, 2, 'f')
-         147±2μs          134±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 1, 4, 'f')
-         143±1μs          129±1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 2, 'f')
-       144±0.6μs        129±0.3μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 2, 'f')
+      44.2±0.5μs       51.3±0.3μs     1.16  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'd')
-      48.1±0.7μs       42.9±0.2μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 1, 'f')
-      50.8±0.4μs       45.7±0.1μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'f')
-      50.5±0.4μs       43.8±0.2μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 1, 'f')
-      52.1±0.2μs       43.2±0.7μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 2, 'f')
-         116±4μs          104±3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 2, 4, 'd')
-      62.0±0.7μs       33.2±0.1μs     0.54  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 1, 'f')
-      84.1±0.2μs       52.3±0.3μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 2, 'f')
-      90.1±0.9μs       66.6±0.2μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'reciprocal'>, 4, 4, 'f')
-      25.9±0.2μs       22.0±0.6μs     0.85  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'f')
-      47.8±0.4μs       42.6±0.6μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 2, 'f')
-      66.7±0.4μs       33.7±0.1μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 1, 'f')
-      82.3±0.1μs       51.0±0.2μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 2, 'f')
-      88.8±0.9μs       66.2±0.3μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 4, 'f')
-       119±0.5μs       80.4±0.5μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 1, 'd')
+      79.7±0.2μs        118±0.3μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 1, 'f')
-         118±2μs       83.8±0.3μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'd')
+      80.3±0.2μs        119±0.4μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'f')
-         125±3μs          108±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'd')
+        82.1±1μs          121±2μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'f')
-       120±0.7μs         81.8±1μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'd')
+      80.2±0.2μs        118±0.3μs     1.48  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'f')
-       118±0.2μs       87.9±0.3μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 2, 'd')
+      80.6±0.6μs        119±0.8μs     1.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 2, 'f')
+      83.7±0.2μs        118±0.2μs     1.41  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 4, 'f')
-         126±4μs         93.3±7μs     0.74  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 1, 'd')
+      81.4±0.7μs        120±0.3μs     1.47  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 1, 'f')
-         120±1μs        104±0.4μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'd')
+      80.9±0.2μs        118±0.1μs     1.46  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 2, 'f')
+        88.5±1μs          118±2μs     1.34  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 4, 4, 'f')
-       225±0.6μs        153±0.9μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 1, 'f')
-       277±0.8μs        194±0.4μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 2, 'f')
-         283±4μs          197±3μs     0.70  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 1, 4, 'f')
-       250±0.9μs        207±0.6μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 1, 'f')
-         306±2μs          241±2μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 2, 'f')
-       305±0.5μs        241±0.6μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'f')
-         291±1μs        207±0.6μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 1, 'f')
-       332±0.5μs          241±3μs     0.73  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 2, 'f')
-         333±5μs          241±4μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'f')
-      60.5±0.3μs       39.0±0.4μs     0.65  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 2, 'f')
-      60.7±0.8μs         53.6±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 1, 4, 'f')
-      54.2±0.3μs       43.4±0.4μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 2, 2, 'f')
-      62.5±0.9μs       36.6±0.1μs     0.58  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 1, 'f')
-      83.0±0.2μs       52.8±0.1μs     0.64  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 2, 'f')
-        88.8±1μs       66.4±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 4, 'f')
-      25.1±0.2μs       21.9±0.3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 1, 'f')
-      47.7±0.3μs       43.1±0.5μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 2, 2, 'f')
-      64.9±0.3μs       33.7±0.1μs     0.52  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 1, 'f')
-      82.5±0.1μs       52.0±0.1μs     0.63  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 2, 'f')
-        88.9±1μs       66.6±0.4μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'square'>, 4, 4, 'f')
-         815±6μs          594±4μs     0.73  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 1, 'd')
-        854±10μs          615±6μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'd')
-       321±0.7μs          256±1μs     0.80  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 2, 'f')
-        896±20μs         644±20μs     0.72  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'd')
-         329±4μs          261±4μs     0.79  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 1, 4, 'f')
-         849±5μs          646±2μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 1, 'd')
-         883±8μs          676±1μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'd')
-         338±2μs          290±2μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 2, 'f')
-        895±30μs         678±20μs     0.76  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'd')
-       337±0.5μs        289±0.9μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 2, 4, 'f')
-         849±5μs          652±3μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'd')
-       302±0.6μs        248±0.5μs     0.82  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 1, 'f')
-         887±7μs          683±3μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'd')
-       352±0.7μs          291±2μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 2, 'f')
-         883±8μs          683±3μs     0.77  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'd')
-         353±5μs          292±4μs     0.83  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'tanh'>, 4, 4, 'f')
-      26.0±0.2μs       22.3±0.3μs     0.86  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 1, 'f')
-      47.6±0.2μs       42.8±0.3μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 2, 'f')
-      67.4±0.7μs       33.5±0.2μs     0.50  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 1, 'f')
-      82.3±0.2μs       51.0±0.1μs     0.62  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 2, 'f')
-        88.7±1μs       66.3±0.2μs     0.75  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'f')

@Mousius
Copy link
Member

Mousius commented May 18, 2023

I would suggest to postponed your current work for 1-2 months till we get done from #2105, then refactoring this work together, and rewriting the C SIMD kernels intrinsics in C++ wouldn't be that hard. What do you think?

@seiko2plus, we're now past the timeframe suggested here, is there any way to unblock this PR? It'd be great to be able to leverage SVE as we migrate the other routines to universal intrinsics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
Status: Awaiting a code review
Development

Successfully merging this pull request may close these issues.

5 participants