Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: float64 tan using Numpy intrinsics #23603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented Apr 17, 2023

This builds on top of #23399 and introduces a NumPy intrinsic variant for float64 tan, taken from Optimized Routines under MIT license.

CPU features:

NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM

Relevant benchmarks:

     <main>           <optimized-routines-tan-f64>
+     82.1±0.03μs        103±0.2μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+     81.6±0.03μs        102±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+     82.4±0.08μs        103±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+      81.9±0.2μs        102±0.3μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+     81.5±0.03μs        101±0.4μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+     81.6±0.04μs       99.9±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+     82.8±0.05μs        101±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd')
+     82.4±0.04μs        100±0.3μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+      82.6±0.3μs       97.2±0.3μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd')
-       200±0.1μs       63.9±0.1μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
-       199±0.1μs      63.4±0.07μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
-       201±0.6μs      63.4±0.04μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
-       200±0.1μs      63.2±0.03μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
-       200±0.1μs      59.4±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd')
-       201±0.2μs      59.3±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
-       200±0.2μs      57.7±0.03μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
-       200±0.3μs      57.6±0.02μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
-       200±0.1μs      53.7±0.01μs     0.27  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')

@Mousius
Copy link
Member Author

Mousius commented Apr 17, 2023

This builds on top of #23399 so I expected sin/cos to be changed, and it appears to have similar unrelated SIMD issues:

       before           after         ratio
     [22e515ad]       [29d1b23c]
     <main>           <optimized-routines-tan-2>
+     44.0±0.01μs       77.4±0.1μs     1.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 2, 'd')
+     44.2±0.04μs      77.8±0.06μs     1.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 4, 'd')
+     43.3±0.02μs      75.9±0.06μs     1.76  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'd')
+     43.6±0.04μs      76.3±0.08μs     1.75  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 4, 'd')
+     43.2±0.04μs       74.8±0.1μs     1.73  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'd')
+     43.9±0.02μs      75.0±0.04μs     1.71  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 2, 1, 'd')
+     45.4±0.03μs      74.0±0.08μs     1.63  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 4, 'd')
+     45.2±0.03μs      73.6±0.04μs     1.63  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'd')
+     45.2±0.02μs      72.3±0.02μs     1.60  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'd')
+     47.0±0.01μs      72.7±0.03μs     1.55  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 2, 'd')
+     47.2±0.04μs      72.8±0.05μs     1.54  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 4, 'd')
+     46.9±0.02μs      70.6±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 2, 1, 'd')
+     46.8±0.02μs      63.0±0.06μs     1.35  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'd')
+      47.1±0.1μs       63.3±0.1μs     1.34  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 4, 'd')
+     46.7±0.03μs       61.2±0.1μs     1.31  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'd')
+     82.1±0.03μs        103±0.2μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+     47.2±0.02μs      59.2±0.03μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'd')
+     81.6±0.03μs        102±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+     47.3±0.06μs      59.3±0.04μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 4, 'd')
+     82.4±0.08μs        103±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+      81.9±0.2μs        102±0.3μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+     81.5±0.03μs        101±0.4μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+     81.6±0.04μs       99.9±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+     82.8±0.05μs        101±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd')
+     82.4±0.04μs        100±0.3μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+     47.1±0.01μs      57.1±0.05μs     1.21  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'd')
+      26.8±0.1μs      32.1±0.09μs     1.20  bench_ufunc.NDArrayAsType.time_astype(('float64', 'longdouble'))
+     8.69±0.01μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
+     8.80±0.01μs      10.5±0.04μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 2, 'f')
+     8.71±0.01μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
+     8.70±0.06μs      10.4±0.02μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
+        8.71±0μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'f')
+     8.77±0.05μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 1, 'f')
+     8.79±0.01μs      10.5±0.04μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 4, 'f')
+     8.72±0.01μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
+     8.80±0.02μs      10.5±0.01μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 4, 'f')
+     8.72±0.06μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
+        8.79±0μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'f')
+     8.80±0.03μs         10.5±0μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 1, 'f')
+     8.79±0.01μs      10.4±0.02μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 2, 'f')
+     8.79±0.02μs      10.4±0.02μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 1, 'f')
+     8.80±0.08μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'f')
+     8.80±0.01μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 1, 'f')
+     8.79±0.09μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 2, 'f')
+        8.81±0μs      10.5±0.01μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 4, 'f')
+      8.82±0.1μs      10.5±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 2, 'f')
+     8.80±0.01μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 1, 'f')
+     8.81±0.01μs         10.5±0μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 2, 'f')
+     8.80±0.07μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 2, 'f')
+        8.84±0μs      10.5±0.03μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 2, 'f')
+     8.81±0.01μs      10.5±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 2, 'f')
+     8.81±0.01μs      10.5±0.02μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 4, 'f')
+      8.80±0.1μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 1, 'f')
+     8.81±0.01μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 2, 'f')
+     8.80±0.01μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 1, 'f')
+     8.81±0.06μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 2, 4, 'f')
+     8.82±0.03μs      10.5±0.02μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'f')
+     8.83±0.03μs      10.5±0.02μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'f')
+     8.80±0.09μs         10.4±0μs     1.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'f')
+     8.82±0.02μs      10.4±0.01μs     1.19  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 4, 'f')
+     8.82±0.01μs      10.5±0.01μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 2, 'f')
+     8.83±0.01μs      10.5±0.01μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 1, 'f')
+     8.82±0.03μs      10.4±0.01μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 4, 4, 'f')
+     8.83±0.03μs         10.5±0μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 2, 'f')
+     8.82±0.01μs      10.4±0.01μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 4, 'f')
+     8.81±0.03μs      10.4±0.01μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 2, 'f')
+     8.83±0.06μs      10.5±0.02μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 2, 'f')
+        8.81±0μs         10.4±0μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 4, 'f')
+     8.85±0.05μs      10.5±0.01μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 4, 'f')
+     8.83±0.01μs         10.4±0μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 4, 'f')
+     8.86±0.06μs      10.5±0.03μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 4, 'f')
+     8.83±0.05μs      10.4±0.02μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 4, 'f')
+      8.84±0.1μs         10.4±0μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 4, 'f')
+     8.84±0.09μs      10.4±0.01μs     1.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'f')
+      82.6±0.3μs       97.2±0.3μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd')
+     8.88±0.03μs      10.5±0.01μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 4, 'f')
+      8.89±0.1μs      10.5±0.02μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 4, 4, 'f')
+      8.93±0.1μs      10.5±0.05μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 4, 'f')
+     8.91±0.05μs      10.5±0.01μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 4, 'f')
+      8.91±0.2μs      10.5±0.01μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 4, 'f')
+      8.91±0.1μs      10.4±0.02μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 1, 'f')
+      8.86±0.2μs      10.4±0.01μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 1, 'f')
+      8.92±0.1μs         10.4±0μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 1, 'f')
+        8.92±0μs      10.4±0.02μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 2, 'f')
+      8.95±0.2μs         10.5±0μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 4, 'f')
+      8.95±0.2μs      10.5±0.01μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 2, 'f')
+      8.96±0.2μs      10.5±0.01μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 2, 'f')
+      8.96±0.1μs         10.5±0μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 1, 2, 'f')
+      8.97±0.2μs         10.5±0μs     1.17  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'deg2rad'>, 2, 2, 'f')
+      9.01±0.2μs      10.5±0.07μs     1.16  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 4, 'f')
+      8.97±0.2μs      10.4±0.01μs     1.16  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'f')
+      8.97±0.2μs      10.4±0.02μs     1.16  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 4, 'f')
+     4.13±0.01μs       4.80±0.1μs     1.16  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'left_shift'>, 1, 1, 1, 'L')
+      9.00±0.2μs      10.4±0.01μs     1.16  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 2, 4, 'f')
+      9.01±0.2μs      10.5±0.03μs     1.16  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 2, 'f')
+      8.94±0.1μs      10.4±0.01μs     1.16  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
+      9.02±0.2μs         10.4±0μs     1.16  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 2, 'f')
+      9.13±0.4μs      10.5±0.04μs     1.15  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 4, 1, 'f')
+      9.12±0.3μs         10.4±0μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 2, 1, 'f')
+      9.13±0.3μs      10.4±0.02μs     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 1, 'f')
+      9.13±0.3μs      10.4±0.01μs     1.14  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'radians'>, 1, 4, 'f')
+     5.71±0.01μs      6.34±0.02μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 4, 2, 'd')
+     5.21±0.03μs       5.78±0.2μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 2, 'e')
+     5.27±0.07μs      5.82±0.08μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 4, 'e')
+      8.72±0.5μs      9.60±0.06μs     1.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 1, 'd')
+      7.82±0.2μs       8.58±0.2μs     1.10  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 1, 2, 'D')
+      26.5±0.1μs       29.0±0.1μs     1.10  bench_ufunc.NDArrayAsType.time_astype(('complex64', 'longdouble'))
+     5.82±0.02μs       6.34±0.1μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 2, 4, 1, 'F')
+     6.26±0.03μs      6.81±0.01μs     1.09  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 4, 1, 1, 'F')
+     5.17±0.03μs      5.62±0.04μs     1.09  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 2, 'e')
+      6.13±0.2μs       6.65±0.2μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 1, 4, 1, 'F')
+     5.12±0.06μs      5.55±0.01μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 1, 'e')
+        5.18±0μs      5.61±0.03μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 2, 'e')
+     5.17±0.03μs      5.60±0.02μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 2, 'e')
+     3.87±0.03μs      4.19±0.02μs     1.08  bench_ufunc.NDArrayAsType.time_astype(('int16', 'float64'))
+     5.11±0.07μs      5.53±0.04μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 1, 'e')
+     5.13±0.02μs      5.55±0.02μs     1.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 2, 1, 'e')
+     7.77±0.03μs       8.40±0.1μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 2, 4, 4, 'F')
+     5.29±0.03μs      5.71±0.03μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 2, 4, 'e')
+     5.18±0.04μs      5.59±0.08μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'e')
+     6.68±0.09μs       7.21±0.3μs     1.08  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in0(<ufunc 'fmin'>, 4, 4, 1, 'd')
+     5.24±0.04μs       5.65±0.2μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 1, 1, 'D')
+     5.39±0.05μs       5.80±0.1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 2, 'f')
+     6.60±0.06μs       7.09±0.1μs     1.08  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'negative'>, 2, 4, 'd')
+     5.73±0.03μs       6.16±0.3μs     1.08  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 2, 2, 2, 'F')
+      52.7±0.2μs       56.6±0.5μs     1.07  bench_ufunc.NDArrayAsType.time_astype(('complex128', 'clongdouble'))
+     5.13±0.05μs      5.51±0.06μs     1.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 1, 'e')
+      10.2±0.3μs      10.9±0.05μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 4, 1, 'D')
+     7.00±0.04μs      7.49±0.03μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 2, 1, 4, 'F')
+     8.41±0.06μs       8.99±0.2μs     1.07  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'maximum'>, 4, 1, 2, 'd')
+      8.46±0.2μs       9.04±0.3μs     1.07  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'maximum'>, 4, 2, 2, 'd')
+     5.17±0.02μs      5.53±0.06μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 1, 'e')
+     5.33±0.04μs      5.69±0.05μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 4, 4, 'e')
+      6.22±0.1μs       6.64±0.2μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 4, 1, 1, 'F')
+     6.58±0.02μs       7.02±0.1μs     1.07  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 2, 4, 2, 'F')
+     5.61±0.01μs       5.98±0.2μs     1.07  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'minimum'>, 4, 4, 1, 'f')
+     5.22±0.05μs      5.57±0.03μs     1.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 2, 'e')
+     5.36±0.02μs      5.71±0.05μs     1.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 4, 'e')
+     5.36±0.03μs      5.70±0.06μs     1.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 4, 4, 'e')
+     6.99±0.09μs      7.43±0.01μs     1.06  bench_ufunc_strides.BinaryInt.time_binary(<ufunc 'minimum'>, 1, 1, 1, 'h')
+     9.98±0.06μs       10.6±0.2μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'multiply'>, 1, 2, 1, 'D')
+     8.38±0.01μs       8.91±0.4μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'maximum'>, 2, 2, 4, 'd')
+     10.3±0.03μs       10.9±0.1μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'maximum'>, 4, 1, 4, 'd')
+     8.06±0.02μs       8.56±0.2μs     1.06  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'minimum'>, 1, 2, 4, 'd')
+      8.42±0.1μs       8.95±0.2μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'minimum'>, 2, 4, 4, 'd')
+     7.77±0.03μs       8.25±0.2μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'fmax'>, 1, 4, 4, 'd')
+     7.72±0.09μs       8.19±0.3μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in1(<ufunc 'maximum'>, 1, 4, 4, 'd')
+     7.77±0.02μs      8.23±0.05μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 1, 4, 2, 'D')
+     5.61±0.02μs       5.94±0.1μs     1.06  bench_ufunc_strides.UnaryIntContig.time_unary(<ufunc 'positive'>, 1, 1, 'H')
+     6.72±0.08μs      7.12±0.06μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'maximum'>, 2, 4, 1, 'd')
+      8.40±0.2μs      8.89±0.03μs     1.06  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'fmax'>, 4, 4, 2, 'd')
+      7.54±0.1μs       7.97±0.1μs     1.06  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'minimum'>, 1, 4, 2, 'd')
+     9.77±0.02μs       10.3±0.1μs     1.06  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 1, 1, 'D')
+     5.77±0.06μs       6.09±0.2μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'square'>, 4, 2, 'd')
+        6.02±0μs       6.34±0.1μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'minimum'>, 2, 4, 4, 'f')
+     36.8±0.04μs       38.8±0.7μs     1.05  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'ldexp'>, 4, 1, 2, 'd')
+     84.8±0.06μs       89.3±0.1μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'e')
+     5.23±0.07μs      5.51±0.02μs     1.05  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'e')
+      9.28±0.4μs      9.77±0.02μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 1, 'f')
+     5.92±0.01μs       6.23±0.2μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in1(<ufunc 'add'>, 4, 1, 1, 'F')
+     5.82±0.01μs      6.12±0.05μs     1.05  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 4, 2, 'd')
+     84.9±0.01μs      89.3±0.04μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 4, 'e')
+      10.5±0.2μs      11.1±0.08μs     1.05  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'fmax'>, 4, 2, 4, 'd')
+      8.33±0.1μs       8.75±0.1μs     1.05  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 4, 2, 'F')
+     84.8±0.05μs      89.1±0.01μs     1.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'e')
-     6.72±0.04μs      6.40±0.01μs     0.95  bench_ufunc_strides.BinaryIntContig.time_binary_scalar_in0(<ufunc 'right_shift'>, 1, 1, 1, 'l')
-      42.0±0.1μs      40.0±0.09μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log1p'>, 1, 4, 'd')
-      11.0±0.2μs      10.5±0.02μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'multiply'>, 4, 2, 2, 'F')
-      8.26±0.2μs       7.86±0.1μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 2, 1, 2, 'D')
-      9.02±0.1μs       8.58±0.1μs     0.95  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'maximum'>, 4, 2, 2, 'd')
-     7.22±0.09μs       6.86±0.2μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rint'>, 2, 4, 'd')
-      7.00±0.1μs       6.66±0.1μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'negative'>, 2, 4, 'd')
-         837±2ns          795±2ns     0.95  bench_ufunc.MethodsV1.time_ndarray_meth('__ge__', 'float64')
-      10.2±0.3μs      9.67±0.05μs     0.95  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'minimum'>, 4, 1, 4, 'd')
-        1.43±0μs      1.36±0.01μs     0.95  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))
-        1.55±0μs      1.47±0.01μs     0.95  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0))
-     1.09±0.01μs         1.03±0μs     0.95  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('sqrt')
-     5.84±0.03μs      5.54±0.01μs     0.95  bench_ufunc.NDArrayAsType.time_astype(('int64', 'complex128'))
-     1.12±0.01μs         1.06±0μs     0.95  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('cos')
-     7.18±0.04μs      6.81±0.02μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 2, 2, 2, 'F')
-      7.04±0.2μs      6.68±0.01μs     0.95  bench_ufunc_strides.BinaryFP.time_binary_scalar_in0(<ufunc 'maximum'>, 1, 4, 1, 'd')
-      6.15±0.1μs      5.83±0.01μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'absolute'>, 1, 4, 'd')
-      6.01±0.1μs      5.70±0.07μs     0.95  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'subtract'>, 1, 1, 1, 'D')
-        1.40±0μs      1.32±0.01μs     0.95  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))
-     9.00±0.09μs       8.51±0.2μs     0.95  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'fmax'>, 2, 2, 4, 'd')
-      7.06±0.1μs      6.67±0.02μs     0.95  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'floor'>, 2, 4, 'd')
-        1.44±0μs      1.36±0.01μs     0.95  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))
-      8.44±0.3μs      7.96±0.04μs     0.94  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 1, 2, 4, 'F')
-     7.87±0.03μs      7.42±0.07μs     0.94  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'fmin'>, 1, 1, 4, 'd')
-      7.18±0.1μs      6.77±0.08μs     0.94  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'maximum'>, 4, 1, 1, 'd')
-      8.86±0.2μs      8.34±0.04μs     0.94  bench_ufunc_strides.BinaryFPSpecial.time_binary_scalar_in1(<ufunc 'fmax'>, 4, 4, 2, 'd')
-      26.9±0.7μs      25.2±0.03μs     0.94  bench_ufunc.NDArrayAsType.time_astype(('float32', 'longdouble'))
-     9.13±0.06μs      8.57±0.01μs     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 4, 1, 'd')
-      5.57±0.3μs      5.21±0.07μs     0.94  bench_ufunc.NDArrayAsType.time_astype(('complex64', 'complex128'))
-     8.47±0.03μs      7.88±0.07μs     0.93  bench_ufunc_strides.BinaryFPSpecial.time_binary(<ufunc 'fmin'>, 2, 1, 4, 'd')
-         178±2μs          166±2μs     0.93  bench_ufunc.UFunc.time_ufunc_types('negative')
-     6.71±0.03μs       6.24±0.1μs     0.93  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 1, 4, 1, 'F')
-      7.43±0.1μs      6.90±0.02μs     0.93  bench_ufunc_strides.BinaryFP.time_binary(<ufunc 'minimum'>, 4, 4, 4, 'f')
-     11.0±0.08μs       10.2±0.1μs     0.93  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 4, 'd')
-      8.46±0.2μs      7.84±0.06μs     0.93  bench_ufunc_strides.BinaryComplex.time_binary(<ufunc 'add'>, 1, 2, 1, 'D')
-      7.00±0.2μs       6.39±0.2μs     0.91  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'subtract'>, 4, 2, 1, 'D')
-     10.2±0.02μs       9.25±0.6μs     0.91  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 4, 'd')
-      19.7±0.4μs      17.7±0.02μs     0.90  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 4, 'e')
-     19.2±0.01μs       17.3±0.4μs     0.90  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 1, 'e')
-      11.2±0.2μs       10.1±0.1μs     0.90  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 4, 'd')
-      20.7±0.4μs       18.4±0.3μs     0.89  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 2, 'e')
-      6.69±0.3μs      5.94±0.01μs     0.89  bench_ufunc_strides.BinaryComplex.time_binary_scalar_in0(<ufunc 'add'>, 2, 4, 1, 'F')
-     11.3±0.09μs      9.97±0.07μs     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 4, 'd')
-     4.69±0.01μs      4.13±0.01μs     0.88  bench_ufunc_strides.BinaryIntContig.time_binary(<ufunc 'left_shift'>, 1, 1, 1, 'q')
-      21.8±0.4μs         19.1±1μs     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 1, 'e')
-      11.3±0.1μs       9.85±0.1μs     0.87  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 4, 'd')
-      20.8±0.4μs      18.1±0.08μs     0.87  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 4, 4, 'e')
-      19.5±0.4μs       16.9±0.8μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'e')
-      19.7±0.4μs      17.0±0.01μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 2, 'e')
-     10.8±0.01μs       9.27±0.6μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
-      19.7±0.4μs         16.6±1μs     0.84  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 4, 'e')
-     10.2±0.05μs       8.56±0.2μs     0.84  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 2, 4, 'd')
-        10.5±0μs      8.74±0.09μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 4, 'd')
-        10.5±0μs       8.70±0.1μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 2, 'd')
-     10.7±0.01μs      8.94±0.06μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 4, 'd')
-     10.7±0.06μs      8.85±0.04μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 4, 'd')
-     10.5±0.04μs       8.70±0.1μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 2, 'd')
-     10.5±0.01μs       8.73±0.1μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
-     10.6±0.09μs      8.74±0.09μs     0.83  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 4, 'd')
-     10.5±0.01μs      8.63±0.03μs     0.83  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 2, 'd')
-     10.5±0.02μs      8.64±0.03μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 4, 'd')
-     10.7±0.06μs       8.73±0.1μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 4, 'd')
-        10.4±0μs         8.55±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
-     10.5±0.07μs      8.60±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 2, 'd')
-        10.4±0μs      8.55±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-        10.4±0μs      8.54±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
-     10.4±0.01μs         8.55±0μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 2, 'd')
-        10.4±0μs         8.54±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-     10.7±0.02μs      8.77±0.06μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 4, 'd')
-     10.4±0.01μs      8.52±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-        10.4±0μs      8.54±0.01μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 2, 1, 'd')
-     10.4±0.01μs         8.54±0μs     0.82  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 2, 1, 'd')
-     10.4±0.01μs      8.52±0.02μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 1, 'd')
-     10.5±0.02μs      8.53±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 2, 'd')
-     10.4±0.01μs      8.52±0.01μs     0.82  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-     10.5±0.08μs      8.57±0.02μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 1, 'd')
-     10.7±0.03μs      8.76±0.04μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 2, 4, 'd')
-     10.5±0.01μs      8.54±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 2, 1, 'd')
-      11.1±0.1μs      8.97±0.05μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 2, 'd')
-     10.7±0.04μs      8.63±0.02μs     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
-     11.1±0.01μs       8.92±0.3μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
-     10.6±0.05μs      8.55±0.01μs     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 4, 1, 'd')
-      11.2±0.1μs       8.88±0.2μs     0.80  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'degrees'>, 4, 2, 'd')
-     50.3±0.08μs      40.0±0.01μs     0.79  bench_ufunc.NDArrayAsType.time_astype(('float16', 'longdouble'))
-     19.3±0.01μs       15.2±0.2μs     0.79  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 2, 'e')
-         199±1μs        152±0.7μs     0.77  bench_ufunc.UFunc.time_ufunc_types('fabs')
-     72.1±0.07μs      50.5±0.06μs     0.70  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 4, 'd')
-     71.9±0.02μs      50.3±0.02μs     0.70  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'd')
-     71.9±0.09μs      50.2±0.01μs     0.70  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 2, 'd')
-      72.6±0.2μs      50.7±0.07μs     0.70  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 4, 'd')
-     71.8±0.04μs      48.1±0.01μs     0.67  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'd')
-     72.1±0.03μs      48.2±0.06μs     0.67  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 4, 'd')
-     71.8±0.08μs      47.2±0.01μs     0.66  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'd')
-     71.7±0.01μs      47.2±0.01μs     0.66  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 2, 1, 'd')
-     71.7±0.03μs      44.7±0.01μs     0.62  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'd')
-     85.3±0.06μs      44.3±0.07μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'd')
-     85.7±0.03μs       44.4±0.1μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 4, 'd')
-      85.0±0.1μs      44.0±0.01μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 2, 'd')
-     85.3±0.09μs      44.1±0.03μs     0.52  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 4, 'd')
-     85.5±0.07μs      41.7±0.01μs     0.49  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'd')
-      85.8±0.2μs      41.8±0.02μs     0.49  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 4, 'd')
-      85.2±0.1μs       40.3±0.1μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'd')
-     84.9±0.05μs      39.7±0.01μs     0.47  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 2, 1, 'd')
-     85.4±0.05μs      36.7±0.07μs     0.43  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'd')
-       200±0.1μs       63.9±0.1μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
-       199±0.1μs      63.4±0.07μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
-       201±0.6μs      63.4±0.04μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
-       200±0.1μs      63.2±0.03μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
-       200±0.1μs      59.4±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd')
-       201±0.2μs      59.3±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
-       200±0.2μs      57.7±0.03μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
-       200±0.3μs      57.6±0.02μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
-       200±0.1μs      53.7±0.01μs     0.27  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')

@Mousius Mousius force-pushed the optimized-routines-tan-f64 branch 4 times, most recently from a838920 to 681ef62 Compare April 17, 2023 21:23
@Mousius Mousius force-pushed the optimized-routines-tan-f64 branch from 681ef62 to 9429113 Compare April 25, 2023 20:03
@Mousius
Copy link
Member Author

Mousius commented Apr 28, 2023

@seiko2plus can you take a look at this one please? I've tried to put the intrinsic code centrally but I see a lot of people have implemented them as part of the various operators, unsure what the strategy is 😸

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few changes, the implementation looks good to me the only issue is that we still going to have the SVML version when it's available (AVX512_SKX & Linux) along with your implementation due to accuracy differences.

@@ -8,6 +8,10 @@
#define NPY_SIMD_FMA3 1 // native support
#define NPY_SIMD_BIGENDIAN 0
#define NPY_SIMD_CMPSIGNAL 0
#ifdef NPY_HAVE_AVX512DQ
#define NPY_SIMD_CVT_F64 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define NPY_SIMD_CVT_F64 1
#define NPY_SIMD_CVT_F64 1
#else
#define NPY_SIMD_CVT_F64 0

Comment on lines 119 to 120
npyv_f64 q = npyv_sub_f64(npyv_muladd_f64(x, two_over_pi, shift), shift);
npyv_s64 qi = npyv_cvt_s64_f64(q);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
npyv_f64 q = npyv_sub_f64(npyv_muladd_f64(x, two_over_pi, shift), shift);
npyv_s64 qi = npyv_cvt_s64_f64(q);
npyv_f64 xp_magic = npyv_muladd_f64(x, two_over_pi, shift);
npyv_f64 q = npyv_sub_f64(xp_magic, shift);
npyv_s64 qi = npyv_sub_s64(npyv_reinterpret_s64_f64(xp_magic), npyv_reinterpret_s64_f64(shift));

Wouldn't lead to the same way? faster than integer truncation towards zero and most important supported by all archs.

@@ -0,0 +1,24 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's that common, I suggest moving them close to the context inside (dispatch-able source) or in a separate header inside umath dir under the prefix simd_ rather than npyv_ in case you plan to share these macros with other sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'll move it to umath, it's not common currently because most of the routines are written in asm for SVML. As we port more of the routines to universal intrinsics you'll see it used often: https://github.com/search?q=repo%3AARM-software%2Foptimized-routines%20estrin&type=code

0x1.7ea75d05b583ep-10, 0x1.289f22964a03cp-11, 0x1.4e4fd14147622p-12
};

npyv_u64 iax = npyv_and_u64(npyv_reinterpret_u64_f64(x), abs_mask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
npyv_u64 iax = npyv_and_u64(npyv_reinterpret_u64_f64(x), abs_mask);
npyv_u64 iax = npyv_reinterpret_u64_f64(npyv_abs_f64 (x));

maps to vabsq_f64 on Neon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think I completely missed the obvious here 😸

@seiko2plus
Copy link
Member

I've tried to put the intrinsic code centrally but I see a lot of people have implemented them as part of the various operators, unsure what the strategy

I think it's fine except for npyv_cvt_s64_f64, it should be moved close to

#define npyv_trunc_f64 vrndq_f64
and renamed to npyv_trunc_s64_f64

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Given perf and accuracy numbers (below), I am not sure if we should retain the SVML implementation for AVX-512. It is a tiny bit faster but having the same implementation across all SIMD archs has its benefits too.

Accuracy:

SVML max ULP error is 3.93 and this one seems to be 3.48. The distributions of ULP errors for values (-1, 1) look to be the same as SVML:

ufunc 0 ulp 1 ulp 2 ulp 3 ulp
NumPy intrinsic tan 61. 29% 37.86 % 0.85 % 0.0 %
SVML tan 59.80 % 39.15 % 1.02 % 0.02 %

Perf (on AVX-512):

SVML implementation is a little bit faster:

       before           after         ratio
     [921a97e6]       [926ed9cd]
     <main>           <optimized-routines-tan-f64>
+         184±2μs        245±0.2μs     1.33  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+       146±0.1μs        192±0.3μs     1.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+       205±0.7μs          259±1μs     1.26  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+         209±1μs          240±3μs     1.15  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+         261±3μs          292±1μs     1.12  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+       182±0.4μs        202±0.8μs     1.11  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+         185±1μs          202±1μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+         144±4μs        153±0.1μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')

}
#endif
#if @fxnan@ == @len@
// Workaround, invalid value encountered when x is set to nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing an invalid raised for np.expm1(np.array([np.nan])) ? Wonder why I am not seeing that locally on my SkylakeX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait never mind. Looks like that code was already there :) If there is no way around the function duplication, then I suggest you get rid of the #if @fxnan@ == @len@ section for simd_tan_f32 below, it seems unnecessary.

* #func = exp2, log2, log10, expm1, log1p, cbrt, tan, asin, acos, atan, sinh, cosh, asinh, acosh, atanh#
* #default_val = 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0#
* #fxnan = 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0#
* #func = exp2, log2, log10, expm1, log1p, cbrt, asin, acos, atan, sinh, cosh, asinh, acosh, atanh#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is way to avoid having two of these simd_@func@_@sfx@ functions. Seems like copy exact code.

* #func = tan#
* #intrin = tan#
*/
NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@func@)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, looks a duplicate of the function below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, sorry, should've put a bit more detail in the PR - we're going to go through all of the routines in https://github.com/ARM-software/optimized-routines, so this is a transient state until I raise the tan f32 PR which moves that out of this file also. I didn't want to do too much gluing together with a condition purely for tan which I would then remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. May be add a comment? And also get rid of the @fxnan@ section above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, a comment would be a good idea 😸 and potentially removing that bit will make the diff more sensible 🤔

@Mousius Mousius force-pushed the optimized-routines-tan-f64 branch 2 times, most recently from e01dd8c to f1cb80e Compare May 26, 2023 21:24
Mousius added 4 commits May 27, 2023 13:02
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
     <main>           <optimized-routines-tan-f64>
+     82.1±0.03μs        103±0.2μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+     81.6±0.03μs        102±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+     82.4±0.08μs        103±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+      81.9±0.2μs        102±0.3μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+     81.5±0.03μs        101±0.4μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+     81.6±0.04μs       99.9±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+     82.8±0.05μs        101±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd')
+     82.4±0.04μs        100±0.3μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+      82.6±0.3μs       97.2±0.3μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd')
-       200±0.1μs       63.9±0.1μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
-       199±0.1μs      63.4±0.07μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
-       201±0.6μs      63.4±0.04μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
-       200±0.1μs      63.2±0.03μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
-       200±0.1μs      59.4±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd')
-       201±0.2μs      59.3±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
-       200±0.2μs      57.7±0.03μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
-       200±0.3μs      57.6±0.02μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
-       200±0.1μs      53.7±0.01μs     0.27  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')
```
Use neg instruction if available, that seems safer, else fallback to
previously created intrinsics which are now available centrally.
* Removes npyv_cvt functions in favour of subtraction method
* Moves Estrin implementation to umath
* Minor text fixes to simd.h for negative.h
@Mousius Mousius force-pushed the optimized-routines-tan-f64 branch from f1cb80e to 3c3abfd Compare May 27, 2023 12:02
@Mousius
Copy link
Member Author

Mousius commented May 27, 2023

Thanks for the feedback @r-devulap and @seiko2plus, I've updated everything to reflect the conversation, please take another look when you get a chance 😸

Mousius added a commit to Mousius/numpy that referenced this pull request May 30, 2023
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
+     45.3±0.07μs       73.0±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f')
+     45.5±0.07μs       73.1±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f')
+     45.1±0.05μs      68.2±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f')
+     45.3±0.04μs      68.3±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f')
+     46.4±0.09μs      65.3±0.04μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f')
+     46.5±0.05μs      65.3±0.02μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f')
+      45.4±0.1μs       63.3±0.2μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f')
+     45.2±0.09μs       60.3±0.2μs     1.33  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f')
+     46.3±0.02μs       57.1±0.2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f')
-      127±0.08μs      38.2±0.02μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f')
-      127±0.05μs      38.3±0.01μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f')
-      128±0.05μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f')
-       129±0.3μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f')
-      128±0.04μs      32.5±0.01μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f')
-       129±0.4μs         32.5±0μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f')
-      127±0.06μs      31.4±0.02μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f')
-      128±0.09μs      27.8±0.02μs     0.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f')
-      128±0.05μs      24.8±0.01μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f')
```
Mousius added a commit to Mousius/numpy that referenced this pull request May 30, 2023
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
+     45.3±0.07μs       73.0±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f')
+     45.5±0.07μs       73.1±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f')
+     45.1±0.05μs      68.2±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f')
+     45.3±0.04μs      68.3±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f')
+     46.4±0.09μs      65.3±0.04μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f')
+     46.5±0.05μs      65.3±0.02μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f')
+      45.4±0.1μs       63.3±0.2μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f')
+     45.2±0.09μs       60.3±0.2μs     1.33  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f')
+     46.3±0.02μs       57.1±0.2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f')
-      127±0.08μs      38.2±0.02μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f')
-      127±0.05μs      38.3±0.01μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f')
-      128±0.05μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f')
-       129±0.3μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f')
-      128±0.04μs      32.5±0.01μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f')
-       129±0.4μs         32.5±0μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f')
-      127±0.06μs      31.4±0.02μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f')
-      128±0.09μs      27.8±0.02μs     0.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f')
-      128±0.05μs      24.8±0.01μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f')
```
Mousius added a commit to Mousius/numpy that referenced this pull request May 30, 2023
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
+     45.3±0.07μs       73.0±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f')
+     45.5±0.07μs       73.1±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f')
+     45.1±0.05μs      68.2±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f')
+     45.3±0.04μs      68.3±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f')
+     46.4±0.09μs      65.3±0.04μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f')
+     46.5±0.05μs      65.3±0.02μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f')
+      45.4±0.1μs       63.3±0.2μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f')
+     45.2±0.09μs       60.3±0.2μs     1.33  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f')
+     46.3±0.02μs       57.1±0.2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f')
-      127±0.08μs      38.2±0.02μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f')
-      127±0.05μs      38.3±0.01μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f')
-      128±0.05μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f')
-       129±0.3μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f')
-      128±0.04μs      32.5±0.01μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f')
-       129±0.4μs         32.5±0μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f')
-      127±0.06μs      31.4±0.02μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f')
-      128±0.09μs      27.8±0.02μs     0.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f')
-      128±0.05μs      24.8±0.01μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f')
```
Mousius added a commit to Mousius/numpy that referenced this pull request May 30, 2023
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
+     45.3±0.07μs       73.0±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f')
+     45.5±0.07μs       73.1±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f')
+     45.1±0.05μs      68.2±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f')
+     45.3±0.04μs      68.3±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f')
+     46.4±0.09μs      65.3±0.04μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f')
+     46.5±0.05μs      65.3±0.02μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f')
+      45.4±0.1μs       63.3±0.2μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f')
+     45.2±0.09μs       60.3±0.2μs     1.33  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f')
+     46.3±0.02μs       57.1±0.2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f')
-      127±0.08μs      38.2±0.02μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f')
-      127±0.05μs      38.3±0.01μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f')
-      128±0.05μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f')
-       129±0.3μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f')
-      128±0.04μs      32.5±0.01μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f')
-       129±0.4μs         32.5±0μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f')
-      127±0.06μs      31.4±0.02μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f')
-      128±0.09μs      27.8±0.02μs     0.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f')
-      128±0.05μs      24.8±0.01μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f')
```
@Mousius Mousius closed this Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants