Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: float64 sin/cos using Numpy intrinsics #23399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 25, 2023

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented Mar 15, 2023

This takes the sin and cos algorithms from Optimized Routines under MIT license, and converts them to Numpy intrinsics.

The routines are within the ULP boundaries of other vectorised math routines (<4ULP). The routines reduce performance in some special cases but improves normal cases. Comparing to the SVML implementation, these routines are more performant in special cases, we're therefore safe to assume the performance is acceptable for AArch64 as well.

performance ratio (lower is better) benchmark
1.8 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 4 2 'd')
1.79 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 4 4 'd')
1.77 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 4 1 'd')
1.74 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 2 2 'd')
1.74 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 2 4 'd')
1.72 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 2 1 'd')
1.6 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 1 2 'd')
1.6 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 1 4 'd')
1.56 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 1 1 'd')
1.42 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 2 2 'd')
1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 2 4 'd')
1.37 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 2 1 'd')
1.26 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 4 2 'd')
1.26 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 4 4 'd')
1.2 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 4 1 'd')
1.18 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 1 2 'd')
1.18 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 1 4 'd')
1.12 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 1 1 'd')
0.65 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 4 2 'd')
0.64 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 2 4 'd')
0.64 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 4 4 'd')
0.64 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 2 2 'd')
0.61 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 1 4 'd')
0.61 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 1 2 'd')
0.6 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 2 1 'd')
0.6 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 4 1 'd')
0.56 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 1 1 'd')
0.52 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 4 2 'd')
0.52 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 4 4 'd')
0.52 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 2 4 'd')
0.52 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 2 2 'd')
0.47 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 1 4 'd')
0.47 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 1 2 'd')
0.46 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 4 1 'd')
0.46 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 2 1 'd')
0.42 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 1 1 'd')

@Mousius Mousius force-pushed the optimized-routines-sincos branch 2 times, most recently from bf46979 to c1bc90e Compare March 15, 2023 20:28
This takes the [sin](https://github.com/ARM-software/optimized-routines/blob/master/math/v_sin.c) and [cos](https://github.com/ARM-software/optimized-routines/blob/master/math/v_cos.c) algorithms from Optimized Routines under MIT license, and converts them to Numpy intrinsics.

The routines are within the ULP boundaries of other vectorised math routines (<4ULP). The routines reduce performance in some special cases but improves normal cases. Comparing to the SVML implementation, these routines are more performant in special cases, we're therefore safe to assume the performance is acceptable for AArch64 as well.

| performance ratio (lower is better)  | benchmark |
| ----  | ---- |
| 1.8   | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	4	2	'd') |
| 1.79  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	4	4	'd') |
| 1.77  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	4	1	'd') |
| 1.74  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	2	2	'd') |
| 1.74  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	2	4	'd') |
| 1.72  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	2	1	'd') |
| 1.6   | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	1	2	'd') |
| 1.6   | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	1	4	'd') |
| 1.56  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'cos'>	1	1	'd') |
| 1.42  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	2	2	'd') |
| 1.41  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	2	4	'd') |
| 1.37  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	2	1	'd') |
| 1.26  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	4	2	'd') |
| 1.26  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	4	4	'd') |
| 1.2   | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	4	1	'd') |
| 1.18  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	1	2	'd') |
| 1.18  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	1	4	'd') |
| 1.12  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc	'sin'>	1	1	'd') |
| 0.65  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	4	2	'd') |
| 0.64  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	2	4	'd') |
| 0.64  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	4	4	'd') |
| 0.64  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	2	2	'd') |
| 0.61  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	1	4	'd') |
| 0.61  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	1	2	'd') |
| 0.6   | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	2	1	'd') |
| 0.6   | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	4	1	'd') |
| 0.56  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'cos'>	1	1	'd') |
| 0.52  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	4	2	'd') |
| 0.52  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	4	4	'd') |
| 0.52  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	2	4	'd') |
| 0.52  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	2	2	'd') |
| 0.47  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	1	4	'd') |
| 0.47  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	1	2	'd') |
| 0.46  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	4	1	'd') |
| 0.46  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	2	1	'd') |
| 0.42  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc	'sin'>	1	1	'd') |

Co-authored-by: Pierre Blanchard <[email protected]>
@Mousius Mousius force-pushed the optimized-routines-sincos branch from c1bc90e to 3a11c0b Compare March 15, 2023 21:05
* #op = cos, sin#
*/
NPY_NOINLINE npyv_f64
simd_@op@_scalar_f64(npyv_f64 x, npyv_f64 y, npyv_b64 cmp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more performant than allowing the compiler to inline the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yip, surprisingly so, I didn't actually want to include it but it helped a lot with the edge cases (iirc, -0.2 to the ratio).

@mattip
Copy link
Member

mattip commented Mar 15, 2023

What CPU features were used when benchmarking? Were these the only changed benchmarks when running the ufunc bnehcmarks?

This relies less on compilers understanding how to create these operations.
@Mousius
Copy link
Member Author

Mousius commented Mar 16, 2023

@mattip, this is using ASIMD on AArch64, just this branch on top of main.

Command:

python runtests.py --bench-compare main --bench bench_ufunc_strides.Unary

Features:

[  0.00%] ··· Importing benchmark suite produced output:
[  0.00%] ···· NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM*

Full Results (I filtered these as they look entirely unrelated yet still changed):

       before           after         ratio
     [e82af031]       [a5e4a516]
     <main>           <sincos>
+     43.3±0.02μs      78.2±0.04μs     1.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+     43.8±0.03μs       78.7±0.2μs     1.80  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 4, 'd')
+     43.2±0.06μs      76.5±0.03μs     1.77  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+     44.0±0.08μs      77.0±0.02μs     1.75  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 2, 'd')
+     44.4±0.04μs      77.2±0.03μs     1.74  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 4, 'd')
+     44.0±0.06μs      75.7±0.04μs     1.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 1, 'd')
+     45.1±0.05μs      72.4±0.03μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+     45.2±0.04μs      72.6±0.02μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 4, 'd')
+     45.1±0.05μs      70.2±0.02μs     1.56  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+     47.0±0.05μs      66.8±0.03μs     1.42  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 2, 'd')
+     47.4±0.09μs      67.0±0.03μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 4, 'd')
+      153±0.08μs         215±20μs     1.41  bench_ufunc.UFunc.time_ufunc_types('fabs')
+     46.9±0.04μs      64.1±0.02μs     1.37  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 1, 'd')
+     46.9±0.04μs      59.2±0.02μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+     47.2±0.03μs      59.5±0.04μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 4, 'd')
+     46.7±0.03μs      56.3±0.07μs     1.20  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
+     47.2±0.07μs      55.8±0.03μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
+     47.4±0.08μs      55.8±0.04μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 4, 'd')
+      8.52±0.1μs       9.88±0.6μs     1.16  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 2, 'd')
+        4.12±0μs         4.68±0μs     1.14  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'left_shift'>, 1, 1, 1, 'Q')
+     47.0±0.09μs      52.6±0.06μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
+     6.10±0.06μs       6.80±0.2μs     1.11  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 2, 4, 1, 'D')
+     5.75±0.09μs       6.30±0.2μs     1.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'absolute'>, 4, 2, 'd')
+      13.8±0.4μs       15.1±0.3μs     1.09  bench_ufunc_strides.BinaryInt.time_binary(<ufunc 'maximum'>, 1, 1, 1, 'I')
+     6.69±0.01μs       7.27±0.2μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 2, 4, 'd')
+     6.46±0.01μs       7.00±0.4μs     1.08  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'fmax'>, 4, 1, 1, 'd')
+      6.53±0.1μs       7.06±0.1μs     1.08  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
+     6.78±0.03μs      7.33±0.03μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 1, 4, 2, 'F')
+     6.91±0.02μs      7.46±0.02μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'square'>, 2, 4, 'd')
+     5.39±0.04μs      5.81±0.08μs     1.08  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'maximum'>, 4, 4, 1, 'f')
+      24.8±0.9μs       26.7±0.4μs     1.08  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'logical_not'>, 1, 1, 'q')
+     5.82±0.08μs      6.26±0.09μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'floor'>, 4, 2, 'd')
+      8.00±0.2μs       8.59±0.2μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 4, 'd')
+      8.06±0.3μs       8.64±0.1μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 4, 'd')
+      7.81±0.1μs       8.35±0.2μs     1.07  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'fmax'>, 1, 1, 4, 'd')
+     9.37±0.07μs       10.0±0.1μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 4, 2, 4, 'F')
+     13.7±0.08μs       14.7±0.2μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 1, 4, 4, 'D')
+     13.9±0.08μs       14.9±0.1μs     1.07  bench_ufunc_strides.BinaryInt.time_binary(<ufunc 'maximum'>, 1, 1, 1, 'i')
+      6.52±0.2μs      6.96±0.04μs     1.07  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'fmax'>, 4, 4, 1, 'd')
+      8.12±0.1μs       8.65±0.3μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'minimum'>, 4, 4, 2, 'd')
+     7.85±0.05μs       8.35±0.1μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 1, 1, 2, 'D')
+      5.42±0.2μs      5.76±0.01μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sqrt'>, 1, 2, 'f')
+     9.34±0.09μs       9.92±0.1μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 4, 2, 2, 'D')
+      109±0.06μs        116±0.4μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 2, 2, 'f')
+       110±0.2μs       116±0.05μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 1, 'f')
+       110±0.1μs       117±0.09μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 2, 'f')
+      8.13±0.2μs      8.63±0.04μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 4, 4, 'd')
+       109±0.2μs        116±0.2μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 2, 4, 'f')
+       109±0.2μs       116±0.05μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 2, 1, 'f')
+     9.89±0.04μs       10.5±0.2μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 4, 1, 'D')
+     9.51±0.04μs      10.1±0.02μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 2, 4, 2, 'D')
+     6.51±0.01μs       6.89±0.2μs     1.06  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'minimum'>, 4, 2, 1, 'd')
+      111±0.09μs        117±0.2μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 4, 4, 'f')
+     8.11±0.08μs       8.58±0.1μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'fmin'>, 2, 4, 2, 'd')
+     7.86±0.05μs       8.32±0.1μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 4, 'd')
+     8.53±0.02μs      9.02±0.04μs     1.06  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'fmin'>, 1, 2, 4, 'd')
+      8.20±0.2μs      8.67±0.09μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 4, 'd')
+      9.82±0.3μs      10.4±0.02μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 2, 2, 1, 'D')
+     5.82±0.03μs      6.15±0.02μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 4, 'd')
+      8.10±0.2μs      8.56±0.01μs     1.06  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'fmin'>, 4, 2, 2, 'd')
+     7.01±0.05μs      7.40±0.01μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 2, 1, 'F')
+     9.70±0.05μs      10.2±0.08μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 1, 4, 4, 'F')
+     37.0±0.03μs       39.0±0.1μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'ldexp'>, 1, 4, 2, 'd')
+      8.15±0.1μs       8.58±0.1μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 4, 2, 2, 'F')
+      110±0.09μs       116±0.05μs     1.05  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 4, 'f')
+     7.67±0.04μs       8.08±0.2μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'maximum'>, 2, 1, 4, 'd')
+     12.9±0.02μs       13.6±0.3μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 4, 4, 1, 'D')
+      9.67±0.3μs       10.2±0.2μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 2, 2, 2, 'D')
+    10.00±0.09μs       10.5±0.4μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'maximum'>, 1, 4, 4, 'd')
+     10.1±0.07μs       10.6±0.3μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'fmin'>, 4, 4, 4, 'd')
+     8.39±0.06μs       8.83±0.1μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'minimum'>, 2, 1, 4, 'd')
+     7.62±0.03μs      8.01±0.08μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 2, 2, 'D')
+      8.21±0.1μs       8.64±0.2μs     1.05  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'fmin'>, 1, 4, 2, 'd')
+     6.68±0.03μs       7.03±0.3μs     1.05  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rint'>, 2, 4, 'd')
+     37.4±0.02μs       39.3±0.3μs     1.05  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'ldexp'>, 1, 1, 4, 'd')
+      11.9±0.1μs       12.5±0.3μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 4, 4, 1, 'D')
+      37.5±0.4μs       39.5±0.2μs     1.05  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'ldexp'>, 1, 1, 2, 'd')
+     8.49±0.05μs       8.92±0.3μs     1.05  bench_ufunc_strides.UnaryComplex.time_unary(<ufunc 'conjugate'>, 4, 1, 'D')
+      10.2±0.3μs      10.7±0.03μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'fmax'>, 4, 4, 4, 'd')
+     6.43±0.02μs       6.75±0.3μs     1.05  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'minimum'>, 4, 4, 1, 'd')
-     8.39±0.05μs       7.99±0.3μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 1, 1, 2, 'D')
-      8.13±0.3μs      7.74±0.01μs     0.95  bench_ufunc_strides.UnaryComplex.time_unary(<ufunc 'square'>, 2, 1, 'D')
-      42.1±0.2μs      40.1±0.05μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log1p'>, 1, 2, 'd')
-     3.35±0.03μs      3.19±0.03μs     0.95  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'left_shift'>, 1, 1, 1, 'B')
-     5.36±0.02μs      5.09±0.03μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 2, 'f')
-         986±3ns          936±2ns     0.95  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'add'>, 1, 1, 1, 'B')
-     6.07±0.07μs      5.76±0.01μs     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 2, 'd')
-     5.28±0.03μs       5.01±0.2μs     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'e')
-     5.32±0.08μs      5.05±0.05μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 2, 'f')
-      9.29±0.2μs       8.81±0.2μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 2, 1, 1, 'D')
-     5.37±0.03μs      5.09±0.08μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 4, 'f')
-     5.21±0.04μs      4.93±0.07μs     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'f')
-     8.15±0.07μs      7.71±0.02μs     0.95  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'minimum'>, 2, 1, 4, 'd')
-      10.4±0.1μs       9.76±0.2μs     0.94  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'maximum'>, 4, 4, 4, 'd')
-     7.71±0.04μs      7.26±0.01μs     0.94  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'maximum'>, 4, 1, 2, 'd')
-     5.24±0.02μs      4.93±0.03μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 1, 'f')
-      10.4±0.2μs       9.74±0.2μs     0.94  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'maximum'>, 4, 2, 4, 'd')
-     5.27±0.02μs      4.95±0.02μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 1, 'f')
-     6.76±0.07μs      6.34±0.04μs     0.94  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'minimum'>, 4, 4, 1, 'd')
-      14.5±0.4μs       13.6±0.1μs     0.94  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 4, 'D')
-      5.40±0.2μs      5.07±0.08μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 2, 'f')
-     5.36±0.02μs      5.02±0.05μs     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 4, 'f')
-     9.79±0.02μs      9.17±0.03μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'd')
-      6.24±0.2μs      5.82±0.04μs     0.93  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 2, 2, 1, 'D')
-     5.48±0.07μs      5.11±0.04μs     0.93  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'f')
-      21.6±0.3μs       20.0±0.1μs     0.92  bench_ufunc.UFunc.time_ufunc_types('bitwise_and')
-     9.58±0.02μs       8.84±0.5μs     0.92  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 2, 'f')
-     9.81±0.01μs       9.04±0.3μs     0.92  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 2, 'f')
-      9.88±0.3μs      9.01±0.09μs     0.91  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 2, 'd')
-     3.39±0.01μs         3.09±0μs     0.91  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'right_shift'>, 1, 1, 1, 'h')
-      9.64±0.2μs       8.77±0.4μs     0.91  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'f')
-      5.64±0.1μs      5.12±0.04μs     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 2, 'f')
-         416±6μs          377±1μs     0.91  bench_ufunc.UFunc.time_ufunc_types('degrees')
-     9.82±0.05μs       8.89±0.1μs     0.90  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 1, 'f')
-      6.58±0.5μs      5.90±0.02μs     0.90  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 2, 1, 1, 'D')
-     20.4±0.05μs       18.1±0.4μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 4, 'e')
-     11.2±0.03μs      9.97±0.02μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 4, 'd')
-     11.1±0.07μs       9.88±0.2μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 4, 'd')
-     9.65±0.01μs      8.56±0.07μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 4, 'f')
-     10.9±0.09μs       9.65±0.1μs     0.88  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 4, 'd')
-        2.26±0μs         2.00±0μs     0.88  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'bitwise_and'>, 1, 1, 1, 'i')
-      19.7±0.4μs       17.4±0.4μs     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 2, 'e')
-     19.3±0.01μs      17.0±0.02μs     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 4, 'e')
-      19.2±0.2μs       16.9±0.3μs     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 1, 'e')
-        4.69±0μs      4.12±0.01μs     0.88  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'left_shift'>, 1, 1, 1, 'q')
-     11.2±0.08μs      9.81±0.04μs     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 4, 'd')
-     9.90±0.03μs      8.55±0.07μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 4, 'f')
-      5.89±0.2μs      5.04±0.09μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 1, 'e')
-     21.2±0.01μs       18.1±0.4μs     0.85  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 2, 'e')
-     10.5±0.01μs       8.85±0.2μs     0.85  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
-      21.0±0.1μs      17.7±0.01μs     0.85  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 1, 'e')
-     10.7±0.05μs       9.01±0.2μs     0.84  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 4, 'd')
-     10.9±0.01μs      9.19±0.09μs     0.84  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
-     10.5±0.01μs      8.77±0.02μs     0.84  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 4, 'd')
-     10.4±0.01μs      8.70±0.05μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 2, 'd')
-     10.4±0.01μs      8.68±0.03μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 2, 'd')
-     10.5±0.01μs       8.70±0.2μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
-     10.5±0.03μs      8.70±0.04μs     0.83  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
-     10.5±0.09μs       8.72±0.2μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
-     10.9±0.04μs      8.98±0.08μs     0.83  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
-     10.5±0.03μs      8.64±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 2, 'd')
-     10.4±0.01μs      8.59±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 4, 'd')
-     10.8±0.04μs       8.92±0.2μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 4, 'd')
-     10.5±0.03μs      8.60±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 2, 'd')
-     10.8±0.02μs      8.85±0.07μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 4, 'd')
-        10.4±0μs      8.53±0.04μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
-        10.4±0μs         8.53±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
-      10.6±0.1μs      8.64±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 4, 'd')
-      11.0±0.2μs      8.98±0.06μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
-     10.4±0.02μs      8.51±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-        10.4±0μs      8.50±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-     10.5±0.01μs         8.52±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-        10.4±0μs      8.51±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 1, 'd')
-        10.4±0μs         8.50±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 1, 'd')
-     10.4±0.01μs      8.51±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-     10.5±0.01μs         8.52±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
-        10.4±0μs         8.48±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 1, 'd')
-     10.5±0.01μs         8.51±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
-     10.5±0.01μs      8.49±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 1, 'd')
-      11.0±0.2μs      8.84±0.08μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 4, 'd')
-     10.6±0.02μs      8.56±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 4, 'd')
-      11.1±0.3μs       8.73±0.1μs     0.79  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
-      19.8±0.4μs      15.3±0.01μs     0.77  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'e')
-      19.7±0.4μs      15.0±0.02μs     0.77  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 4, 'e')
-      19.7±0.4μs      15.0±0.03μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 2, 'e')
-     72.2±0.09μs       46.6±0.1μs     0.65  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 4, 'd')
-     71.8±0.03μs      46.2±0.02μs     0.64  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-     71.9±0.08μs       46.2±0.1μs     0.64  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 4, 'd')
-      71.8±0.1μs      46.0±0.01μs     0.64  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 2, 'd')
-      72.0±0.2μs       43.9±0.1μs     0.61  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 4, 'd')
-     71.7±0.07μs      43.7±0.01μs     0.61  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-     71.7±0.04μs      43.3±0.03μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-     71.7±0.04μs      43.2±0.04μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 1, 'd')
-     71.7±0.06μs       40.4±0.1μs     0.56  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-      85.1±0.1μs      44.4±0.02μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 4, 'd')
-      85.0±0.2μs       44.1±0.1μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-     84.9±0.06μs      43.8±0.02μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 2, 'd')
-      85.2±0.1μs      43.9±0.02μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 4, 'd')
-      85.3±0.3μs      40.1±0.09μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 4, 'd')
-     85.2±0.04μs      39.9±0.01μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-     84.8±0.04μs       39.7±0.7μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 1, 'd')
-     84.8±0.04μs      39.4±0.08μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-     85.1±0.06μs      35.8±0.04μs     0.42  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')

Hopefully this works better in the 32-bit builds
@Mousius Mousius force-pushed the optimized-routines-sincos branch from 9a5eb01 to 8fae1b4 Compare March 16, 2023 12:55
@Mousius
Copy link
Member Author

Mousius commented Mar 16, 2023

@mattip I'm a bit stumped by the 32-bit windows failures, are there any guides on how to reproduce them easily?

@mattip
Copy link
Member

mattip commented Mar 16, 2023

The 32-bit windows failures are strange. @seiko2plus could you take a look?

@mattip
Copy link
Member

mattip commented Mar 16, 2023

Comparing to the SVML implementation, these routines are more performant in special cases

Did you check benchmarks on a machine that uses SVML?

@seiko2plus
Copy link
Member

I'm a bit stumped by the 32-bit windows failures, are there any guides on how to reproduce them easily?
The 32-bit windows failures are strange.

From a quick look, I guess the problem is due to not force inlining the following function

NPY_NOINLINE npyv_f64
simd_@op@_scalar_f64(npyv_f64 x, npyv_f64 y, npyv_b64 cmp)
{
// MSVC doesn't compile with direct vector access, so we copy it here

Mainly because of stack alignment, spilling a register with width larger than 128-bit over 128-bit alignment could cause these strange errors.

This is to test the theory that noinline is causing stack spilling that
collides with alignment in weird ways.
@Mousius Mousius force-pushed the optimized-routines-sincos branch from b4fb0ff to ca33625 Compare March 20, 2023 09:18
@Mousius
Copy link
Member Author

Mousius commented Mar 20, 2023

Comparing to the SVML implementation, these routines are more performant in special cases

Did you check benchmarks on a machine that uses SVML?

Yip, it was one of the newer cloud instances.

@Mousius
Copy link
Member Author

Mousius commented Mar 20, 2023

I'm a bit stumped by the 32-bit windows failures, are there any guides on how to reproduce them easily?
The 32-bit windows failures are strange.

From a quick look, I guess the problem is due to not force inlining the following function

NPY_NOINLINE npyv_f64
simd_@op@_scalar_f64(npyv_f64 x, npyv_f64 y, npyv_b64 cmp)
{
// MSVC doesn't compile with direct vector access, so we copy it here

Mainly because of stack alignment, spilling a register with width larger than 128-bit over 128-bit alignment could cause these strange errors.

Ahhh, I see other references to similar behaviour elsewhere, I've added some logic to mitigate it - thanks for the guidance!

@Mousius Mousius force-pushed the optimized-routines-sincos branch from b5f7d21 to f5fda54 Compare March 21, 2023 09:41
@mattip
Copy link
Member

mattip commented Mar 21, 2023

Did you check benchmarks on a machine that uses SVML?

Yip, it was one of the newer cloud instances.

Could you add the results to the PR, including the output of np.show_runtime()?

@seiko2plus could you review?

@Mousius
Copy link
Member Author

Mousius commented Mar 22, 2023

Did you check benchmarks on a machine that uses SVML?

Yip, it was one of the newer cloud instances.

Could you add the results to the PR, including the output of np.show_runtime()?

@seiko2plus could you review?

>>> np.show_runtime()
WARNING: `threadpoolctl` not found in system! Install it by `pip install threadpoolctl`. Once installed, try `np.show_runtime` again for more detailed build information
[{'numpy_version': '1.25.0.dev0+942.g3785c1937',
  'python': '3.9.5 (default, Nov 23 2021, 15:27:38) \n[GCC 9.3.0]',
  'uname': uname_result(system='Linux', node='', release='5.15.0-1031-aws', version='#35~20.04.1-Ubuntu SMP Sat Feb 11 16:19:06 UTC 2023', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': ['AVX512_KNL', 'AVX512_KNM']}}]
       before           after         ratio
     [b35aac2c]       [3785c193]
     <main>           <sincos>
+        3.94±0μs         6.33±0μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 1, 'e')
+     3.94±0.01μs      6.33±0.01μs     1.61  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'e')
+     3.94±0.01μs         6.33±0μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 1, 'e')
+     3.94±0.01μs      6.32±0.01μs     1.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'e')
+     3.94±0.01μs      6.33±0.01μs     1.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 1, 'e')
+     3.95±0.01μs      6.32±0.01μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 1, 'e')
+     3.98±0.01μs         6.36±0μs     1.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'e')
+        3.98±0μs         6.36±0μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 2, 'e')
+     3.98±0.01μs         6.36±0μs     1.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 2, 'e')
+     3.98±0.01μs         6.36±0μs     1.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 2, 'e')
+        3.98±0μs         6.36±0μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 2, 'e')
+        3.99±0μs         6.36±0μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 2, 'e')
+        4.00±0μs         6.37±0μs     1.59  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 4, 'e')
+        4.00±0μs         6.37±0μs     1.59  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 4, 'e')
+     4.00±0.01μs         6.37±0μs     1.59  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 4, 'e')
+        4.01±0μs         6.38±0μs     1.59  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 4, 'e')
+        4.01±0μs         6.38±0μs     1.59  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 4, 'e')
+        4.01±0μs         6.38±0μs     1.59  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 4, 'e')
+        4.97±0μs         7.81±0μs     1.57  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'H')
+        2.80±0μs         4.21±0μs     1.50  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'b')
+        4.59±0μs         6.36±0μs     1.39  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 1, 'e')
+        4.59±0μs         6.37±0μs     1.39  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 1, 'e')
+     4.60±0.01μs         6.36±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'e')
+     4.60±0.01μs         6.36±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'e')
+     4.61±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 1, 'e')
+     4.61±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'e')
+        4.61±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'e')
+        4.60±0μs         6.36±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'e')
+        4.61±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 2, 1, 'e')
+        4.61±0μs      6.38±0.01μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'e')
+     4.60±0.01μs         6.36±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'e')
+        4.61±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 2, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 2, 2, 'e')
+     4.61±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 1, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'e')
+        4.61±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 2, 1, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 2, 4, 'e')
+        4.61±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 2, 1, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 2, 4, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 2, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 2, 2, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 4, 'e')
+        4.62±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 2, 1, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 4, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 4, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 2, 2, 'e')
+     4.62±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 2, 1, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 4, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 2, 4, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'e')
+     4.62±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 2, 1, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 4, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 2, 2, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 2, 4, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 2, 4, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 2, 2, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 4, 'e')
+        4.62±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 2, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 2, 4, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 4, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 4, 'e')
+        4.62±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 4, 'e')
+     4.62±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'e')
+        4.62±0μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 2, 2, 'e')
+     4.62±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 2, 'e')
+        4.63±0μs      6.38±0.01μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 4, 'e')
+     4.63±0.01μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 4, 'e')
+     4.62±0.01μs         6.37±0μs     1.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'e')
+        4.63±0μs         6.38±0μs     1.38  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 4, 'e')
+     8.39±0.03μs      11.3±0.05μs     1.35  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'right_shift'>, 1, 1, 1, 'q')
+     8.32±0.01μs      11.2±0.04μs     1.35  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'right_shift'>, 1, 1, 1, 'l')
+     18.5±0.06μs       24.2±0.2μs     1.31  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 4, 'e')
+     18.4±0.04μs       23.9±0.1μs     1.30  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 2, 'e')
+     6.72±0.03μs         8.71±2μs     1.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sinh'>, 1, 2, 'f')
+     18.5±0.08μs       23.9±0.3μs     1.29  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 4, 'e')
+     13.4±0.03μs       17.2±0.2μs     1.28  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp'>, 1, 1, 'e')
+      18.9±0.3μs       23.9±0.2μs     1.27  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 1, 'e')
+      18.6±0.1μs       23.4±0.3μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'e')
+     18.5±0.05μs       23.1±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 2, 'e')
+     18.0±0.02μs      22.2±0.07μs     1.23  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 4, 'e')
+     12.2±0.01μs      15.0±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 1, 'e')
+     12.2±0.01μs         15.1±0μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 4, 1, 'e')
+     18.1±0.05μs      22.2±0.07μs     1.23  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 2, 'e')
+     18.4±0.01μs      22.6±0.01μs     1.23  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 1, 'f')
+     12.2±0.01μs      15.1±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 2, 1, 'e')
+     12.2±0.01μs      15.0±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 4, 'e')
+     12.2±0.01μs         15.0±0μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 4, 4, 'e')
+     12.3±0.02μs      15.0±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 4, 2, 'e')
+        12.3±0μs      15.0±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 2, 2, 'e')
+     12.3±0.01μs      15.0±0.01μs     1.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 2, 4, 'e')
+      20.4±0.1μs       24.9±0.2μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 4, 'f')
+     12.3±0.02μs      15.0±0.01μs     1.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 2, 'e')
+     18.5±0.01μs      22.6±0.01μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 1, 'e')
+     18.0±0.03μs       21.9±0.1μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 1, 'e')
+     20.2±0.06μs       24.5±0.3μs     1.21  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 2, 'f')
+      17.3±0.4μs       20.9±0.5μs     1.20  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 2, 4, 2, 'D')
+     20.4±0.01μs         24.3±3μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'd')
+     14.3±0.02μs      17.0±0.05μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'e')
+      12.1±0.4μs       14.3±0.9μs     1.18  bench_ufunc_strides.UnaryComplex.time_unary(<ufunc 'conjugate'>, 2, 4, 'D')
+     15.3±0.01μs      17.9±0.02μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 1, 'e')
+      22.0±0.5μs       25.6±0.8μs     1.17  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 4, 4, 'D')
+        4.20±0μs      4.89±0.02μs     1.16  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'right_shift'>, 1, 1, 1, 'Q')
+      15.0±0.3μs       17.4±0.4μs     1.16  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 2, 1, 4, 'D')
+     24.1±0.03μs      27.8±0.01μs     1.16  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'e')
+     15.3±0.01μs         17.7±0μs     1.15  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 1, 'f')
+      9.42±0.3μs       10.8±0.8μs     1.15  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 2, 2, 2, 'D')
+     7.59±0.01μs       8.73±0.9μs     1.15  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 2, 4, 1, 'D')
+     15.6±0.06μs       17.9±0.1μs     1.15  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 2, 'f')
+        4.11±0μs      4.72±0.05μs     1.15  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'right_shift'>, 1, 1, 1, 'L')
+     33.4±0.02μs       38.3±0.3μs     1.15  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 4, 4, 2, 'D')
+      16.6±0.1μs      19.0±0.08μs     1.15  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 4, 2, 'f')
+     15.7±0.07μs      18.0±0.04μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 4, 'f')
+     15.0±0.01μs      17.2±0.03μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.04μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.02μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.03μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 4, 'd')
+     15.0±0.02μs      17.1±0.01μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.01μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.02μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.01μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 4, 'd')
+      23.0±0.7μs       26.2±0.3μs     1.14  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 2, 4, 4, 'D')
+     21.9±0.04μs       25.0±0.4μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 2, 4, 'f')
+     15.0±0.01μs      17.1±0.01μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 2, 4, 'd')
+     15.0±0.01μs      17.1±0.02μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 2, 4, 'd')
+     22.0±0.08μs      25.0±0.05μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 4, 2, 'f')
+        37.6±1μs       42.7±0.2μs     1.14  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'Q')
+     16.6±0.02μs      18.9±0.05μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 2, 4, 'f')
+     15.7±0.01μs      17.8±0.06μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 4, 1, 'f')
+     47.6±0.09μs         53.8±2μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 4, 'e')
+     21.9±0.04μs       24.8±0.3μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 4, 4, 'f')
+      11.1±0.4μs       12.5±0.5μs     1.13  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 2, 4, 2, 'D')
+      14.0±0.3μs      15.8±0.03μs     1.13  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 4, 'D')
+      47.6±0.2μs         53.8±2μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 2, 'e')
+     47.6±0.03μs         53.8±2μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 2, 'e')
+     47.6±0.08μs         53.7±2μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 2, 4, 'e')
+        5.66±0μs         6.39±0μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 1, 'e')
+      47.6±0.2μs         53.8±1μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 2, 2, 'e')
+     15.3±0.02μs      17.3±0.03μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 4, 1, 'f')
+     16.7±0.04μs       18.9±0.1μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 2, 2, 'f')
+        5.66±0μs      6.39±0.01μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 1, 1, 'e')
+     47.7±0.07μs         53.8±2μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 4, 1, 'e')
+      47.6±0.1μs         53.7±1μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 1, 'e')
+        5.68±0μs         6.40±0μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 2, 1, 'e')
+        5.68±0μs         6.41±0μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 2, 2, 'e')
+     5.68±0.01μs         6.41±0μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 4, 1, 'e')
+     24.9±0.01μs      28.1±0.06μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 1, 'e')
+        5.68±0μs         6.40±0μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 2, 1, 'e')
+     15.3±0.01μs      17.2±0.09μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 2, 1, 'f')
+     5.69±0.01μs         6.40±0μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 4, 1, 'e')
+      47.7±0.1μs         53.7±1μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 1, 4, 'e')
+     23.5±0.06μs         26.5±2μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 2, 'd')
+        5.69±0μs         6.40±0μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 1, 2, 'e')
+     22.0±0.05μs       24.8±0.4μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 2, 2, 'f')
+     5.69±0.01μs         6.40±0μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 2, 'e')
+        5.69±0μs         6.40±0μs     1.13  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 2, 2, 'e')
+        5.69±0μs         6.40±0μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 1, 4, 'e')
+      47.8±0.1μs         53.7±1μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'trunc'>, 2, 1, 'e')
+      24.7±0.2μs      27.8±0.07μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 1, 'f')
+      11.9±0.3μs       13.3±0.2μs     1.12  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 2, 1, 'D')
+        5.70±0μs         6.40±0μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 2, 4, 'e')
+        5.70±0μs         6.41±0μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 2, 4, 'e')
+        5.70±0μs         6.40±0μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 1, 4, 'e')
+     5.70±0.01μs      6.40±0.01μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 4, 2, 'e')
+        5.71±0μs         6.41±0μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 4, 4, 'e')
+     50.4±0.07μs         56.5±4μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'e')
+        5.72±0μs      6.41±0.01μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 4, 4, 'e')
+     5.71±0.01μs         6.40±0μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 4, 2, 'e')
+      50.9±0.6μs       57.0±0.8μs     1.12  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'divide'>, 4, 4, 4, 'D')
+     17.0±0.08μs      19.0±0.01μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 4, 4, 'f')
+     5.87±0.01μs       6.57±0.6μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctanh'>, 1, 1, 'f')
+      21.1±0.7μs       23.6±0.5μs     1.12  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 2, 4, 'D')
+     17.4±0.01μs      19.4±0.08μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 4, 'd')
+      50.3±0.2μs         56.2±3μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'e')
+      50.6±0.3μs         56.6±3μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'e')
+     17.3±0.03μs      19.3±0.02μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 4, 'd')
+      50.7±0.3μs         56.6±3μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'e')
+     16.6±0.06μs      18.6±0.04μs     1.12  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 2, 'f')
+      50.6±0.2μs         56.4±4μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 4, 'e')
+        50.9±1μs       56.8±0.4μs     1.11  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 4, 4, 'D')
+     15.8±0.02μs      17.6±0.02μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 2, 1, 'f')
+      19.9±0.8μs       22.2±0.4μs     1.11  bench_ufunc_strides.UnaryComplex.time_unary(<ufunc 'square'>, 4, 4, 'D')
+      16.2±0.1μs       18.0±0.1μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 2, 4, 'f')
+     17.4±0.03μs      19.3±0.02μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 4, 4, 'd')
+     17.4±0.01μs      19.3±0.01μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 4, 'd')
+      16.6±0.1μs      18.5±0.07μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 4, 'f')
+      50.7±0.3μs         56.4±3μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'e')
+      48.8±0.1μs         54.3±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 2, 'e')
+     48.3±0.08μs         53.7±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 1, 'e')
+      25.0±0.2μs         27.8±2μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 4, 'd')
+     48.5±0.09μs         53.8±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 4, 'e')
+     17.4±0.01μs      19.3±0.01μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 4, 'd')
+     17.4±0.02μs      19.3±0.01μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 4, 'd')
+     17.4±0.01μs      19.3±0.01μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 4, 'd')
+     17.4±0.01μs      19.3±0.01μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 4, 'd')
+      50.6±0.2μs         56.2±3μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 4, 'e')
+     17.4±0.02μs      19.3±0.01μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 4, 'd')
+      51.8±0.2μs         57.4±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 1, 'e')
+      49.0±0.3μs         54.3±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 2, 2, 'e')
+      50.7±0.4μs         56.2±3μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 4, 'e')
+      49.0±0.1μs         54.3±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 1, 'e')
+     16.3±0.03μs      18.0±0.07μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 4, 2, 'f')
+      49.0±0.2μs         54.3±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 1, 4, 'e')
+      51.9±0.1μs         57.4±3μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 4, 'e')
+      48.5±0.1μs         53.7±4μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 2, 'e')
+      50.8±0.3μs         56.3±3μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'e')
+      16.3±0.1μs       18.0±0.2μs     1.11  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 2, 2, 'f')
+      51.1±0.3μs         56.4±3μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 4, 'e')
+      51.8±0.3μs         57.2±3μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 4, 'e')
+        35.4±1μs       39.1±0.9μs     1.10  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 2, 4, 4, 'D')
+     17.5±0.06μs      19.3±0.02μs     1.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 4, 'd')
+      49.2±0.2μs         54.3±4μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 2, 1, 'e')
+      49.3±0.1μs         54.3±4μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 2, 4, 'e')
+        27.5±1μs       30.3±0.8μs     1.10  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 4, 1, 4, 'D')
+     14.1±0.01μs      15.5±0.03μs     1.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'expm1'>, 1, 1, 'f')
+      36.8±0.1μs         40.5±3μs     1.10  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'L')
+      35.2±0.6μs       38.7±0.1μs     1.10  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 2, 4, 'D')
+      36.9±0.3μs         40.5±2μs     1.10  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'conjugate'>, 1, 1, 'q')
+      51.7±0.2μs         56.8±3μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 1, 'e')
+      21.8±0.1μs         23.9±1μs     1.10  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 4, 1, 4, 'D')
+     7.41±0.04μs       8.13±0.4μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctanh'>, 1, 4, 'f')
+      52.1±0.3μs         57.1±3μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 2, 'e')
+      50.8±0.3μs         55.7±3μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'e')
+      50.7±0.1μs         55.5±3μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'e')
+      50.9±0.1μs       55.7±0.8μs     1.10  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 4, 4, 4, 'D')
+      36.6±0.9μs         40.0±2μs     1.09  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'q')
+      36.8±0.5μs         40.2±2μs     1.09  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'absolute'>, 1, 1, 'Q')
+      18.8±0.3μs       20.5±0.6μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 4, 2, 2, 'D')
+      18.6±0.8μs       20.3±0.5μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 2, 2, 'D')
+     3.84±0.02μs      4.19±0.01μs     1.09  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 1, 1, 'L')
+      51.8±0.1μs         56.5±3μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 2, 'e')
+     6.31±0.01μs         6.89±0μs     1.09  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'right_shift'>, 1, 1, 1, 'q')
+      13.7±0.4μs       15.0±0.3μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 4, 2, 4, 'D')
+      38.5±0.3μs      41.8±0.05μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'divide'>, 4, 4, 2, 'D')
+      15.2±0.1μs       16.5±0.8μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 1, 2, 4, 'D')
+      13.2±0.6μs       14.4±0.1μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 4, 1, 2, 'D')
+     49.6±0.03μs         53.8±2μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 1, 'e')
+      39.0±0.7μs         42.3±1μs     1.08  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'square'>, 1, 1, 'l')
+     49.6±0.03μs         53.8±1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 2, 'e')
+     49.6±0.02μs         53.8±2μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 2, 'e')
+      13.1±0.3μs       14.2±0.3μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 2, 4, 'D')
+     49.6±0.02μs         53.8±2μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 4, 'e')
+      27.5±0.1μs      29.8±0.09μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 2, 'f')
+     49.6±0.01μs         53.7±1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 2, 4, 'e')
+     49.6±0.02μs         53.7±1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 2, 2, 'e')
+     49.6±0.02μs         53.7±1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 4, 'e')
+      11.0±0.3μs       11.9±0.3μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 4, 2, 2, 'D')
+     49.6±0.04μs         53.7±2μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'e')
+     14.3±0.02μs      15.5±0.04μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'expm1'>, 1, 1, 'e')
+     49.6±0.02μs         53.7±1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 2, 1, 'e')
+       221±0.2μs        239±0.3μs     1.08  bench_ufunc.UFunc.time_ufunc_types('sign')
+      16.7±0.2μs      18.0±0.02μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 4, 4, 'f')
+      21.2±0.2μs         22.8±1μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 2, 2, 4, 'D')
+        21.6±0μs       23.3±0.3μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 2, 1, 'f')
+       255±0.2μs          275±1μs     1.08  bench_ufunc.UFunc.time_ufunc_types('logical_xor')
+     14.3±0.01μs       15.4±0.2μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'f')
+      37.4±0.3μs         40.2±1μs     1.07  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'absolute'>, 1, 1, 'l')
+      38.4±0.2μs         41.2±1μs     1.07  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'square'>, 1, 1, 'L')
+     3.85±0.01μs      4.13±0.06μs     1.07  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 1, 1, 'q')
+      22.7±0.8μs       24.3±0.3μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 4, 2, 4, 'D')
+      36.5±0.5μs       39.1±0.3μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 4, 2, 4, 'D')
+      16.1±0.3μs      17.3±0.01μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 2, 1, 4, 'D')
+      27.7±0.2μs      29.7±0.05μs     1.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 4, 'f')
+      39.0±0.1μs         41.8±2μs     1.07  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'square'>, 1, 1, 'q')
+     34.2±0.05μs         36.6±1μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 4, 2, 'D')
+     9.80±0.01μs       10.5±0.4μs     1.07  bench_ufunc_strides.UnaryComplex.time_unary(<ufunc 'conjugate'>, 4, 2, 'D')
+      37.3±0.2μs         39.8±1μs     1.07  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'conjugate'>, 1, 1, 'L')
+      13.9±0.5μs      14.8±0.04μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 2, 4, 4, 'D')
+         845±3ns         900±20ns     1.07  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('sqrt')
+         320±1μs          341±1μs     1.07  bench_ufunc.UFunc.time_ufunc_types('minimum')
+     7.66±0.06μs       8.15±0.4μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 2, 'D')
+     17.0±0.04μs      18.1±0.03μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log10'>, 2, 4, 'f')
+      37.0±0.3μs       39.4±0.8μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'divide'>, 4, 2, 4, 'D')
+         890±3ns          945±3ns     1.06  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('cos')
+     21.6±0.01μs      22.9±0.07μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 4, 1, 'f')
+      36.3±0.3μs       38.5±0.6μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 4, 4, 'D')
+     12.3±0.07μs      13.0±0.06μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 2, 4, 'D')
+     3.85±0.03μs      4.07±0.06μs     1.06  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 1, 1, 'Q')
+     7.83±0.04μs       8.29±0.2μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 2, 1, 2, 'D')
+      78.0±0.9μs       82.5±0.9μs     1.06  bench_ufunc_strides.BinaryInt.time_binary_scalar_in0(<ufunc 'maximum'>, 2, 2, 1, 'L')
+     47.7±0.09μs         50.5±1μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'reciprocal'>, 1, 4, 'e')
+     3.84±0.02μs      4.06±0.03μs     1.06  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 1, 1, 'l')
+     38.8±0.08μs         41.0±1μs     1.06  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'sign'>, 1, 1, 'l')
+      23.1±0.7μs       24.4±0.3μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 4, 4, 'D')
+     6.86±0.01μs      7.24±0.08μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'absolute'>, 4, 4, 'd')
+      47.8±0.2μs         50.5±1μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'reciprocal'>, 4, 1, 'e')
+      36.4±0.3μs       38.5±0.4μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 4, 2, 4, 'D')
+         493±3μs        520±0.4μs     1.05  bench_ufunc.UFunc.time_ufunc_types('true_divide')
+      36.8±0.3μs       38.8±0.5μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 2, 4, 4, 'D')
+      20.9±0.1ms      22.0±0.07ms     1.05  bench_ufunc.At.time_maximum_at
+      47.9±0.1μs         50.5±1μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'reciprocal'>, 4, 2, 'e')
+      51.6±0.4μs       54.3±0.7μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 4, 4, 4, 'D')
+     17.0±0.04μs      17.9±0.08μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log10'>, 2, 2, 'f')
+      48.0±0.2μs         50.5±1μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'reciprocal'>, 1, 1, 'e')
+       215±0.3μs          226±1μs     1.05  bench_ufunc.UFunc.time_ufunc_types('equal')
+     7.49±0.02μs      7.86±0.06μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'subtract'>, 4, 2, 1, 'D')
-      5.07±0.2μs         4.81±0μs     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log1p'>, 1, 1, 'f')
-      9.43±0.1μs       8.95±0.1μs     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccosh'>, 1, 2, 'f')
-        20.3±1μs      19.2±0.03μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 1, 'd')
-      11.0±0.3μs       10.4±0.1μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 4, 4, 2, 'D')
-      30.3±0.2μs       28.7±0.2μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 2, 'd')
-     20.0±0.07μs       19.0±0.1μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsin'>, 2, 2, 'f')
-      8.90±0.3μs      8.43±0.04μs     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')
-      20.2±0.2μs      19.1±0.01μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsin'>, 2, 4, 'f')
-        21.6±0μs      20.4±0.04μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 4, 4, 'f')
-       189±0.3μs          178±3μs     0.94  bench_ufunc.UFunc.time_ufunc_types('radians')
-      7.94±0.3μs      7.50±0.01μs     0.94  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 1, 1, 'D')
-        73.6±3μs       69.5±0.2μs     0.94  bench_ufunc.UFunc.time_ufunc_types('ldexp')
-      25.6±0.1μs       24.1±0.2μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsinh'>, 1, 4, 'f')
-         935±6ns          882±3ns     0.94  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'subtract'>, 1, 1, 1, 'B')
-      10.4±0.4μs      9.77±0.09μs     0.94  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'maximum'>, 4, 2, 4, 'd')
-      9.21±0.3μs      8.69±0.02μs     0.94  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 1, 2, 'D')
-     20.1±0.01μs      18.9±0.07μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 2, 1, 'f')
-        724±30ns          681±2ns     0.94  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint32'>, 100)
-        718±30ns          675±2ns     0.94  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint8'>, 100)
-      13.1±0.7μs      12.3±0.03μs     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 4, 'f')
-      25.9±0.3μs      24.3±0.02μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccos'>, 4, 2, 'f')
-      21.7±0.1μs       20.4±0.2μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 4, 2, 'f')
-      10.6±0.4μs      9.96±0.07μs     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 2, 'f')
-        780±40ns          732±5ns     0.94  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 100)
-        848±30ns        795±0.8ns     0.94  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint64'>, 100)
-     27.8±0.01μs       26.1±0.4μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 1, 'd')
-        24.5±1μs      22.9±0.02μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp'>, 1, 4, 'd')
-      25.9±0.2μs      24.3±0.01μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccos'>, 2, 4, 'f')
-     26.0±0.06μs      24.3±0.01μs     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsinh'>, 1, 2, 'f')
-      23.9±0.3μs       22.3±0.3μs     0.93  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 1, 'd')
-       189±0.3μs          177±2μs     0.93  bench_ufunc.UFunc.time_ufunc_types('deg2rad')
-     21.7±0.03μs      20.2±0.03μs     0.93  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 2, 4, 'f')
-     21.7±0.04μs      20.2±0.03μs     0.93  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 2, 2, 'f')
-      8.02±0.4μs      7.45±0.04μs     0.93  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 2, 4, 1, 'D')
-       190±0.1μs          177±3μs     0.93  bench_ufunc.UFunc.time_ufunc_types('rad2deg')
-     14.5±0.04μs       13.4±0.9μs     0.93  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 4, 1, 2, 'D')
-       188±0.7μs          175±4μs     0.93  bench_ufunc.UFunc.time_ufunc_types('fabs')
-        740±30ns          686±5ns     0.93  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True, where=True))
-      16.7±0.1μs      15.5±0.03μs     0.93  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log'>, 1, 1, 'e')
-      10.5±0.7μs      9.76±0.03μs     0.93  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'maximum'>, 4, 4, 2, 'd')
-     8.32±0.07μs      7.70±0.07μs     0.93  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 4, 1, 'D')
-       189±0.6μs          175±4μs     0.92  bench_ufunc.UFunc.time_ufunc_types('degrees')
-     21.7±0.02μs      20.0±0.03μs     0.92  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 2, 'f')
-        701±30ns          645±2ns     0.92  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True))
-      23.4±0.3μs      21.5±0.02μs     0.92  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsinh'>, 1, 1, 'f')
-        762±30ns          701±2ns     0.92  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 100)
-        521±40ns        479±0.6ns     0.92  bench_ufunc.UFuncSmall.time_ufunc_small_array('abs')
-     19.9±0.01μs      18.3±0.07μs     0.92  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 1, 'f')
-     10.8±0.08μs       9.87±0.1μs     0.92  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 2, 2, 'D')
-      8.39±0.5μs       7.69±0.2μs     0.92  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 4, 1, 'D')
-        604±20ns          554±2ns     0.92  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))
-      22.0±0.1μs      20.1±0.09μs     0.92  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 4, 'f')
-      4.63±0.3μs      4.22±0.02μs     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsin'>, 1, 1, 'f')
-     20.1±0.01μs      18.3±0.02μs     0.91  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 1, 'e')
-      24.9±0.1μs      22.6±0.02μs     0.91  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccos'>, 2, 1, 'f')
-     24.9±0.04μs      22.6±0.02μs     0.91  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccos'>, 4, 1, 'f')
-      8.74±0.7μs      7.95±0.01μs     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp'>, 1, 1, 'd')
-        512±40ns        462±0.8ns     0.90  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('abs')
-     7.38±0.06μs      6.67±0.08μs     0.90  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 1, 2, 'f')
-      8.34±0.7μs      7.50±0.01μs     0.90  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'multiply'>, 4, 2, 1, 'D')
-        26.2±2μs       23.5±0.2μs     0.90  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'expm1'>, 1, 2, 'd')
-     13.5±0.02μs       12.1±0.2μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cbrt'>, 1, 2, 'd')
-     15.0±0.01μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
-        38.9±4μs      34.8±0.03μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 4, 1, 'd')
-        15.0±0μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')
-        22.2±1μs      19.8±0.07μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccosh'>, 1, 2, 'd')
-     15.0±0.01μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
-        15.0±0μs         13.4±0μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
-        15.0±0μs         13.4±0μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
-     15.0±0.01μs      13.4±0.02μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'd')
-     15.0±0.01μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 4, 2, 'd')
-     15.0±0.01μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 2, 'd')
-     15.0±0.01μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 2, 'd')
-     15.0±0.01μs      13.4±0.01μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 2, 'd')
-      12.5±0.1μs       11.0±0.9μs     0.88  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 1, 4, 2, 'D')
-      3.52±0.4μs      3.08±0.01μs     0.88  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log10'>, 1, 1, 'f')
-      9.75±0.6μs      8.52±0.09μs     0.87  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 1, 1, 'D')
-      13.3±0.9μs       11.2±0.3μs     0.85  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'maximum'>, 4, 4, 4, 'd')
-     9.04±0.01μs      7.56±0.01μs     0.84  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-        6.42±0μs      5.29±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 4, 'f')
-        6.42±0μs         5.29±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 4, 'f')
-        6.42±0μs         5.29±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 4, 'f')
-        6.42±0μs         5.29±0μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 4, 'f')
-        6.42±0μs         5.29±0μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 4, 'f')
-     6.43±0.01μs         5.29±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 4, 'f')
-     6.67±0.02μs      5.49±0.01μs     0.82  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'right_shift'>, 1, 1, 1, 'l')
-        15.0±0μs      12.3±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 1, 'd')
-        15.0±0μs      12.3±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 1, 'd')
-        15.0±0μs      12.3±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'f')
-     15.0±0.01μs         12.3±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 2, 1, 'd')
-        15.0±0μs      12.3±0.03μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')
-     15.0±0.01μs      12.3±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
-     15.0±0.01μs      12.3±0.04μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 2, 'f')
-     15.0±0.01μs      12.3±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
-     15.0±0.01μs         12.3±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 2, 'f')
-     15.0±0.01μs      12.3±0.03μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 2, 1, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 2, 2, 'f')
-        15.0±0μs      12.2±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 2, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')
-     15.0±0.01μs         12.3±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 4, 1, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.04μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'd')
-     15.0±0.01μs      12.2±0.03μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 1, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 1, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 1, 'd')
-        15.0±0μs      12.2±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'f')
-        15.0±0μs      12.2±0.05μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 2, 'f')
-        15.0±0μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'd')
-        15.0±0μs      12.2±0.02μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 2, 'f')
-     15.0±0.01μs         12.2±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 1, 'd')
-        15.0±0μs      12.2±0.03μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 1, 'd')
-     15.0±0.01μs      12.2±0.04μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 2, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 2, 'f')
-        15.0±0μs      12.2±0.03μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 2, 'f')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-     15.0±0.01μs      12.2±0.03μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 1, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 1, 'd')
-     15.0±0.01μs      12.2±0.04μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 1, 'd')
-     15.0±0.02μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 2, 2, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 2, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 2, 'f')
-     15.0±0.01μs      12.2±0.03μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'f')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 2, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 1, 'f')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 2, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 1, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 4, 2, 'f')
-        15.0±0μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 4, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 2, 'f')
-     15.0±0.01μs      12.2±0.03μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 1, 'f')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'f')
-        15.0±0μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 2, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
-        15.0±0μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 2, 4, 'f')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 2, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 2, 2, 'f')
-     15.0±0.02μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-     15.0±0.02μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 1, 'f')
-     15.0±0.01μs      12.2±0.04μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 1, 'f')
-     15.0±0.03μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 1, 'f')
-        15.0±0μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 2, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'f')
-        15.0±0μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 2, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 2, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'd')
-     15.0±0.01μs      12.2±0.04μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 2, 'd')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 4, 'f')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 2, 2, 'd')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 2, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 2, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 4, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 2, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 2, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 4, 'f')
-        15.0±0μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 2, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 2, 1, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 2, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 2, 'd')
-        15.0±0μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 4, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'd')
-     15.0±0.02μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 4, 'f')
-     15.0±0.02μs      12.2±0.03μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 1, 'f')
-     15.0±0.01μs      12.2±0.03μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 4, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 4, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 4, 'f')
-        15.0±0μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 4, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 2, 'd')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 4, 'f')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 4, 'f')
-        15.0±0μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 2, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 4, 'f')
-     15.0±0.01μs      12.2±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 4, 'f')
-     15.0±0.01μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 4, 'f')
-     15.0±0.01μs      12.2±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 4, 'f')
-        15.0±0μs         12.1±0μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 4, 'f')
-        15.0±0μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 4, 'f')
-     15.0±0.01μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 2, 'd')
-        15.0±0μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 4, 'f')
-     15.0±0.01μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 4, 'f')
-     15.0±0.01μs      12.1±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 4, 'f')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
-     15.0±0.01μs         12.2±0μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
-     12.1±0.01μs      9.27±0.01μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 2, 1, 'e')
-     12.1±0.01μs      9.26±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 2, 1, 'e')
-     12.1±0.01μs         9.26±0μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 2, 2, 'e')
-     12.1±0.01μs         9.26±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 1, 'e')
-     12.1±0.01μs      9.27±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 4, 'e')
-     12.1±0.01μs         9.26±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 2, 2, 'e')
-     12.1±0.01μs      9.23±0.01μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 1, 1, 'e')
-        12.1±0μs         9.27±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 4, 'e')
-        12.1±0μs      9.26±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 2, 'e')
-     12.1±0.01μs      9.26±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 2, 4, 'e')
-        12.1±0μs         9.26±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 1, 4, 'e')
-        12.1±0μs      9.25±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 4, 1, 'e')
-        12.1±0μs         9.22±0μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 1, 'e')
-     12.1±0.01μs      9.25±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 2, 'e')
-     12.1±0.01μs         9.26±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 4, 2, 'e')
-        12.1±0μs      9.25±0.01μs     0.76  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 4, 'e')
-     12.1±0.01μs         9.24±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 1, 2, 'e')
-     12.1±0.01μs         9.25±0μs     0.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 2, 4, 'e')
-      9.05±0.6μs      6.73±0.02μs     0.74  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-      12.1±0.1μs       8.96±0.1μs     0.74  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-      11.2±0.4μs      8.19±0.01μs     0.73  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-        6.39±0μs      4.62±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 2, 4, 'f')
-     6.40±0.01μs         4.62±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 2, 4, 'f')
-     6.39±0.01μs      4.61±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 2, 4, 'f')
-        6.40±0μs      4.61±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 2, 4, 'f')
-        6.40±0μs      4.61±0.02μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 2, 4, 'f')
-        6.40±0μs         4.61±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 2, 4, 'f')
-     6.39±0.01μs      4.60±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 4, 'f')
-        6.39±0μs         4.60±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 4, 'f')
-        6.38±0μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 2, 1, 'f')
-        6.38±0μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'f')
-        6.40±0μs      4.60±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 4, 'f')
-        6.39±0μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'f')
-        6.38±0μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'f')
-        6.39±0μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 4, 'f')
-        6.38±0μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'f')
-        6.39±0μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 2, 'f')
-     6.39±0.01μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 2, 'f')
-        6.39±0μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 4, 'f')
-        6.39±0μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 2, 'f')
-        6.38±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'f')
-        6.39±0μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 2, 1, 'f')
-     6.39±0.01μs         4.59±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 4, 'f')
-     6.39±0.01μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 2, 2, 'f')
-        6.39±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 1, 'f')
-        6.39±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 2, 'f')
-        6.38±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'f')
-        6.39±0μs      4.58±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 2, 1, 'f')
-        6.39±0μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 2, 2, 'f')
-        6.39±0μs      4.58±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 2, 1, 'f')
-        6.39±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 1, 'f')
-     6.39±0.01μs      4.58±0.01μs     0.72  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 2, 2, 'f')
-        6.40±0μs      4.58±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 2, 2, 'f')
-     6.39±0.01μs      4.58±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 2, 1, 'f')
-        6.39±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 2, 2, 'f')
-        6.39±0μs         4.58±0μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 2, 2, 'f')
-        6.39±0μs      4.57±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 2, 1, 'f')
-        6.42±0μs      4.59±0.01μs     0.72  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'f')
-     6.41±0.01μs      4.58±0.01μs     0.71  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 2, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 2, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 2, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 4, 1, 'f')
-        6.42±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 2, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 4, 1, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'f')
-        6.41±0μs         4.56±0μs     0.71  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (1), 4, 1, 'f')
-        6.41±0μs         4.57±0μs     0.71  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 4, 1, 'f')
-      35.5±0.3μs       23.8±0.2μs     0.67  bench_ufunc.UFunc.time_ufunc_types('right_shift')
-      13.8±0.4μs      9.14±0.09μs     0.66  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 4, 'd')
-      13.1±0.1μs      8.64±0.06μs     0.66  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 4, 'd')
-        6.39±0μs      3.86±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 4, 'f')
-        6.37±0μs      3.84±0.02μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 1, 'f')
-        6.37±0μs      3.83±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 1, 'f')
-        6.40±0μs      3.84±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 4, 'f')
-        6.37±0μs      3.83±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 1, 'f')
-        6.37±0μs      3.83±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 1, 'f')
-     6.39±0.01μs      3.84±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 2, 'f')
-        6.40±0μs      3.84±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 4, 'f')
-        6.39±0μs      3.84±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 4, 'f')
-        6.37±0μs      3.82±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'f')
-        6.39±0μs      3.83±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 4, 'f')
-        6.40±0μs      3.83±0.02μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 4, 'f')
-     6.39±0.01μs      3.83±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 2, 'f')
-        6.39±0μs      3.82±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 2, 'f')
-     6.36±0.01μs      3.81±0.02μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'f')
-        6.39±0μs      3.83±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'f')
-     6.39±0.01μs      3.82±0.01μs     0.60  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 2, 'f')
-        6.39±0μs      3.81±0.01μs     0.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 2, 'f')
-     82.1±0.09μs       48.8±0.4μs     0.59  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 4, 'd')
-      79.3±0.3μs       46.7±0.7μs     0.59  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-      81.6±0.3μs         48.0±1μs     0.59  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-     81.7±0.04μs         45.4±1μs     0.56  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-     19.9±0.02μs      11.0±0.04μs     0.55  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 2, 'd')
-      84.6±0.3μs       46.6±0.8μs     0.55  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-      85.9±0.4μs       47.0±0.8μs     0.55  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 4, 'd')
-      20.6±0.1μs      11.1±0.01μs     0.54  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 2, 'd')
-     19.8±0.07μs      9.83±0.02μs     0.50  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 4, 'd')
-      20.6±0.1μs      10.2±0.02μs     0.49  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 4, 'd')
-     20.7±0.09μs      10.1±0.07μs     0.49  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-     20.6±0.09μs      9.97±0.01μs     0.48  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 4, 'd')
-     19.9±0.02μs      9.52±0.04μs     0.48  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-     19.9±0.08μs      9.31±0.02μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 4, 'd')
-     19.8±0.01μs       8.55±0.1μs     0.43  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-     20.5±0.01μs      8.80±0.06μs     0.43  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-     20.4±0.01μs      8.64±0.01μs     0.42  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 1, 'd')
-     19.7±0.01μs       8.34±0.1μs     0.42  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 1, 'd')
-       120±0.1μs       47.9±0.1μs     0.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-      121±0.09μs       47.9±0.2μs     0.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 4, 'd')
-       121±0.2μs       47.7±0.1μs     0.39  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 4, 'd')
-       120±0.2μs       47.4±0.1μs     0.39  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 2, 'd')
-      119±0.05μs       46.4±0.3μs     0.39  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 1, 'd')
-      120±0.06μs       46.4±0.1μs     0.39  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-      123±0.04μs       47.2±0.7μs     0.39  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-      122±0.03μs         46.8±1μs     0.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 2, 'd')
-      123±0.08μs         47.0±1μs     0.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 4, 'd')
-       123±0.2μs         46.9±1μs     0.38  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 4, 'd')
-      122±0.05μs         45.0±2μs     0.37  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 1, 'd')
-      122±0.03μs         45.0±2μs     0.37  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')

@mhvk
Copy link
Contributor

mhvk commented Mar 22, 2023

Bit from the sidelines, but it seems that non-trig functions for which this PR should be irrelevant become quite a bit slower (like np.positive and np.conjugate, where the operation itself costs very little time). Would be good to understand why that is, because it seems avoidable -- if it is something generic about how ufuncs are evaluated that can be solved, then the improvement for the trig functions may be even larger than it appears here.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor changes.

Comment on lines 191 to 196
if (npyv_any_b64(cmp)) {
/* If fenv exceptions are to be triggered correctly, set any special lanes
to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by
scalar loop later. */
r = npyv_select_f64(cmp, npyv_setall_f64(1.0), r);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (npyv_any_b64(cmp)) {
/* If fenv exceptions are to be triggered correctly, set any special lanes
to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by
scalar loop later. */
r = npyv_select_f64(cmp, npyv_setall_f64(1.0), r);
}
/* If fenv exceptions are to be triggered correctly, set any special lanes
to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by
scalar loop later. */
r = npyv_select_f64(cmp, npyv_setall_f64(1.0), r);

typo? this branch cost way more than a single blend instruction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this comes from https://github.com/ARM-software/optimized-routines/blob/master/math/v_sin.c#L69, where originally it was way more optimised for the happy path - will update!

@@ -2,19 +2,25 @@
** $maxopt baseline
** (avx2 fma3) avx512f
** vsx2 vsx3 vsx4
** neon_vfpv4
** asimd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
** asimd
** neon_vfpv4

neon_vfpv4 already implies asimd on aarch64 since it's part of the hardware baseline so no need for this change.

}

if (npyv_any_b64(cmp)) {
out = simd_@op@_scalar_f64(x_in, out, cmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
out = simd_@op@_scalar_f64(x_in, out, cmp);
out = npyv_select_f64(cmp, x_in, out);
out = simd_@op@_scalar_f64(out, npyv_tobits_b64(cmp));

You can only pass one vector instead, which suppose to perform better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What architecture is that for? On AArch64 function calls can use the vector registers directly (Ref) so it'll end up the same overhead as far as I know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but I meant using one vector to avoid extra element access and passing bitfields instead of the boolean vector should save at least 1OP due to the nature of any/all and tobits e.g. on x86(sse/avx) they use movemask.

@seiko2plus
Copy link
Member

Did you check benchmarks on a machine that uses SVML?

This implementation is very close to SVML, for long args SVML uses Payne-Hanek style reduction while this pr falback to libm.

@seberg
Copy link
Member

seberg commented Apr 25, 2023

@Mousius forgot, but if you would like it, you can follow up with adding a brief towncrier fragment for the performance enhancement.

Mousius added a commit to Mousius/numpy that referenced this pull request Apr 25, 2023
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
     <main>           <optimized-routines-tan-f64>
+     82.1±0.03μs        103±0.2μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+     81.6±0.03μs        102±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+     82.4±0.08μs        103±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+      81.9±0.2μs        102±0.3μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+     81.5±0.03μs        101±0.4μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+     81.6±0.04μs       99.9±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+     82.8±0.05μs        101±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd')
+     82.4±0.04μs        100±0.3μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+      82.6±0.3μs       97.2±0.3μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd')
-       200±0.1μs       63.9±0.1μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
-       199±0.1μs      63.4±0.07μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
-       201±0.6μs      63.4±0.04μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
-       200±0.1μs      63.2±0.03μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
-       200±0.1μs      59.4±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd')
-       201±0.2μs      59.3±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
-       200±0.2μs      57.7±0.03μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
-       200±0.3μs      57.6±0.02μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
-       200±0.1μs      53.7±0.01μs     0.27  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')
```
@seberg
Copy link
Member

seberg commented Apr 28, 2023

Just a note. It is probably fine, but some downstream packages are noticing the difference in precison in their test suite (which surprised me slightly, because I thought it would be the same as SVML and presumably that had been used in the test suite before).

@Mousius
Copy link
Member Author

Mousius commented Apr 28, 2023

Just a note. It is probably fine, but some downstream packages are noticing the difference in precison in their test suite (which surprised me slightly, because I thought it would be the same as SVML and presumably that had been used in the test suite before).

Thanks for the heads up @seberg, I also believe this is fine, @PierreBlanchard noted that in reference to the numpy test suite it's not necessarily the case that you'll hit the maximum error case even across a large number of samples: #23171 (comment)

Having said that, I'll keep an eye on the ones that are linking back to this PR, are there others I should be aware of?

@mdhaber
Copy link
Contributor

mdhaber commented Apr 28, 2023

are there others I should be aware of?

I suppose it's linked above already, but in scipy/scipy#18382, np.cos(np.pi) != -1.

@mhvk
Copy link
Contributor

mhvk commented Apr 29, 2023

Just to add that I also found it a bit surprising that np.cos(np.pi) != -1 exactly, especially since np.mod(np.pi, np.pi/2) is correctly equal to exactly 0, so if this is done in the first quadrant and then mirrored one would expect it. But np.sin(np.pi/2) is also not exactly 1, so perhaps that is it. (Also, even on previous numpy, np.sin(np.pi) is not exactly 0.)

@WarrenWeckesser
Copy link
Member

np.sin(np.pi) is not exactly 0.

You're probably aware of this, but that is a correct result. np.pi is not exactly $\pi$. In fact np.pi $\approx \pi -$ 1.2246467991473532e-16, so the result

In [48]: np.sin(np.pi)
Out[48]: 1.2246467991473532e-16

is the correct double precision result.

Even taking the inexactness of np.pi into account, the correct double precision result of np.cos(np.pi) is -1. In fact, the double precision result of $\cos(\pi + \delta)$ should be -1 for $|\delta| &lt; \sqrt{\varepsilon/2} \approx 1.05367\textrm{e-}8$, where $\varepsilon$ is the double precision "machine epsilon" (approx. 2.22e-16).

What we're seeing now is that np.cos(np.pi) = -0.9999999999999998, which differs from -1 by 2 ULPs. According to @Mousius, this is within the stated acceptable error bounds: "The routines are within the ULP boundaries of other vectorised math routines (<4ULP)."

I think we're being caught off guard because the errors, while within the stated bounds, are occurring in a "surprising" location, where we haven't seen them before (namely around $\pm\pi$, where in a relatively large interval, the correct result is -1).

Although I'm familiar with all the usual precautions about floating point calculations, I'm still guilty of writing code that expected np.cos(np.pi) to be -1. (At least a unit test caught the change, and made explicit the change in behavior of np.cos.) It would be nice if the library could ensure that np.cos(np.pi) returned -1, but I don't know if implementing such a guarantee would hurt performance significantly. If this is the state of the art for vectorized trig functions, then we'll have to adapt our code (in my case, SciPy code) to allow for these errors.

@Mousius
Copy link
Member Author

Mousius commented May 2, 2023

What we're seeing now is that np.cos(np.pi) = -0.9999999999999998, which differs from -1 by 2 ULPs. According to @Mousius, this is within the stated acceptable error bounds: "The routines are within the ULP boundaries of other vectorised math routines (<4ULP)."

Yip, this aligns with SVML and libmvec as the current expectation for vector math routines. Is there a way to ask the function to dispatch a non-vectorised variant which does the scalar loop instead? That would at least allow the user to trade speed/accuracy themselves?

@seberg
Copy link
Member

seberg commented May 2, 2023

How do they compare in practice in terms of accuracy here?

I am somewhat curious if any of the lib consideres that results should (also) be correct in terms of inputs rather than output accuracy? I.e. bounded by correct_func(4_ulp_before(val)) <= func(val) <= correct_func(4_ulp_after(val)) rather than 4_ulp_before(correct_fuc(val)) <= func(val) <= 4_ulp_after(correct_func(val)) which would mean -1 is enforced.

@PierreBlanchard
Copy link

PierreBlanchard commented May 2, 2023

I am somewhat curious if any of the lib consideres that results should (also) be correct in terms of inputs rather than output accuracy?

It was not formalised like this in the optimized-routines implementations, as only forward error was considered (not backward errors). But reported errors include the rounding error for the input (+0.5ulp in round to nearest), so do numpy tests I believe.

@PierreBlanchard
Copy link

It would be nice if the library could ensure that np.cos(np.pi) returned -1, but I don't know if implementing such a guarantee would hurt performance significantly. If this is the state of the art for vectorized trig functions, then we'll have to adapt our code (in my case, SciPy code) to allow for these errors.

It is great to get so much feedback so quickly on this type of design issues. If numpy expects this type of behaviour then there is certainly a way to fix it, but I thought I should add a bit of context.

I very much agree that it would be nice to have (as it seems to matter for many users). Unfortunately, as you guessed it, enforcing that comes at a large cost in performance in most cases. I have not looked into it but there might be a way to minimise this cost on this specific case. However, in general it is not something that should be expected from efficient vector implementations of math routines.
Most elementary math libraries follow the C99 standard, which does not specify this kind of behaviour, and as mentioned a few times, an accuracy threshold > 1ulp allows for such error to occur.

It seems that this type of considerations is usually left to the user, as each user might have different expectations on the number of cases that have simple/known outputs. Users might also have cheaper options to simplify their code to take that into account. If not then falling back to a scalar implementation is probably the way to go.

@seberg
Copy link
Member

seberg commented May 2, 2023

It seems that this type of considerations is usually left to the user,

The question is whether the current thing is a safe default for the vast majority of users. Unfortunately, we have to choose for the user.

We don't have good user control over it in NumPy. (Yes you can at compile time, and you can also disable most vectorizations using an env variable.).
We could also expand that to add something like with np.accuracy("low"): but that machinery doesn't exist right now.

So if this is a bad default, that is a problem. We could expose it maybe, but without new API it would be pretty useless in practice. Having better API might also allow us to be a bit more relaxed about a more inaccurate default.

I don't want to argue about the 4 ULP choice. But maybe a question is if/what other practical constraints there are? (E.g. I expect you ensure that the codomain is correct as in -1 <= sin(x) <= 1 for all x; or that the function isn't non-monotonic around branch points.)

@seberg
Copy link
Member

seberg commented May 2, 2023

@pllim also, how weird are the test adjustments in astropy? We had similar fallout with SVML, but I am not sure it was quite as surprising.

Below are plots plot I get without svml, with svml and with the new code for sin and a zoom in around np.pi/2 +/- 1e-8. Although, I suspect the main point is really whether the special value error is a problem or not.

EDIT: So basically SVML, is just as bad and jitters around pi/2 also. On average maybe a tiny bit less and for pi/2 itself it happens to be better by chance or not...

plots

These plots (and probably more so, the rather arbitrary choice of a 1000 point linspace) are not ideal, but maybe it helps give an idea...

sin
sin-zoom

@ksunden
Copy link
Contributor

ksunden commented May 2, 2023

Writeup of how the change cascades in mpl's test provided here: matplotlib/matplotlib#25789 (comment)

The tl;dr is that np.sin(pi/2) is changed from precisely 1.0 and that cascades through several computations including np.cos(pi/2) and multiplications that prior to this change all precisely canceled to get the exact same input to a forward/reverse transform change.

@pllim
Copy link
Contributor

pllim commented May 3, 2023

Re: #23399 (comment)

The toughest one we haven't figured out has something to do with Gaussian2D theta calculations roundtripping that caused bounding box size to be off by one, causing a test to fail. We haven't had a chance to decide whether that is a bug or just a poorly defined test. (This one magically disappeared after some follow-up numpy changes.)

@ksunden
Copy link
Contributor

ksunden commented May 4, 2023

matplotlib/matplotlib#25813

Is our test updates to be more tolerant/avoid inherently unstable/sensitive floating point operations

@seberg
Copy link
Member

seberg commented May 4, 2023

Right now, I am operating under the assumption that these are really all somewhat annoying but OK test upgrades and while it might be nice to tweak things to be slightly more precise, it is hopefully OK as is.

But, if there are things that break in worse ways than just tests (or seem likely to do so in user code), I still think we should consider dealing or checking whether a small precision bump (via better polynomial factors) helps.

@mhvk
Copy link
Contributor

mhvk commented May 6, 2023

For astropy at least, some of the test I wrote for spherical to cartesian break just because of the lack of equality of sin(pi/2) to unity. Obviously, all adjustable, but it does seem that a polynomial conditioned on actually spanning the full -1 to 1 inclusive range would be nicer...

Mousius added a commit to Mousius/numpy that referenced this pull request May 27, 2023
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
     <main>           <optimized-routines-tan-f64>
+     82.1±0.03μs        103±0.2μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+     81.6±0.03μs        102±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+     82.4±0.08μs        103±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+      81.9±0.2μs        102±0.3μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+     81.5±0.03μs        101±0.4μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+     81.6±0.04μs       99.9±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+     82.8±0.05μs        101±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd')
+     82.4±0.04μs        100±0.3μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+      82.6±0.3μs       97.2±0.3μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd')
-       200±0.1μs       63.9±0.1μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
-       199±0.1μs      63.4±0.07μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
-       201±0.6μs      63.4±0.04μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
-       200±0.1μs      63.2±0.03μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
-       200±0.1μs      59.4±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd')
-       201±0.2μs      59.3±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
-       200±0.2μs      57.7±0.03μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
-       200±0.3μs      57.6±0.02μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
-       200±0.1μs      53.7±0.01μs     0.27  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
Development

Successfully merging this pull request may close these issues.