Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: float32 tan using NumPy intrinsics #23844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented May 30, 2023

This builds on top of #23603, and introduces a NumPy intrinsic variant for float32 tan, taken from Optimized Routines under MIT license.

CPU features:

NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM

Relevant benchmarks:

+     45.3±0.07μs       73.0±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f')
+     45.5±0.07μs       73.1±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f')
+     45.1±0.05μs      68.2±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f')
+     45.3±0.04μs      68.3±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f')
+     46.4±0.09μs      65.3±0.04μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f')
+     46.5±0.05μs      65.3±0.02μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f')
+      45.4±0.1μs       63.3±0.2μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f')
+     45.2±0.09μs       60.3±0.2μs     1.33  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f')
+     46.3±0.02μs       57.1±0.2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f')
-      127±0.08μs      38.2±0.02μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f')
-      127±0.05μs      38.3±0.01μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f')
-      128±0.05μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f')
-       129±0.3μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f')
-      128±0.04μs      32.5±0.01μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f')
-       129±0.4μs         32.5±0μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f')
-      127±0.06μs      31.4±0.02μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f')
-      128±0.09μs      27.8±0.02μs     0.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f')
-      128±0.05μs      24.8±0.01μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f')

Mousius added 4 commits May 27, 2023 13:02
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
     <main>           <optimized-routines-tan-f64>
+     82.1±0.03μs        103±0.2μs     1.26  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+     81.6±0.03μs        102±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+     82.4±0.08μs        103±0.2μs     1.25  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+      81.9±0.2μs        102±0.3μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+     81.5±0.03μs        101±0.4μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+     81.6±0.04μs       99.9±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+     82.8±0.05μs        101±0.4μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd')
+     82.4±0.04μs        100±0.3μs     1.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+      82.6±0.3μs       97.2±0.3μs     1.18  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd')
-       200±0.1μs       63.9±0.1μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
-       199±0.1μs      63.4±0.07μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
-       201±0.6μs      63.4±0.04μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
-       200±0.1μs      63.2±0.03μs     0.32  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
-       200±0.1μs      59.4±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd')
-       201±0.2μs      59.3±0.03μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
-       200±0.2μs      57.7±0.03μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
-       200±0.3μs      57.6±0.02μs     0.29  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
-       200±0.1μs      53.7±0.01μs     0.27  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')
```
Use neg instruction if available, that seems safer, else fallback to
previously created intrinsics which are now available centrally.
* Removes npyv_cvt functions in favour of subtraction method
* Moves Estrin implementation to umath
* Minor text fixes to simd.h for negative.h
npyv_f32 n = npyv_sub_f32(q, shift);

/* n is representable as a signed integer, simply convert it. */
npyv_s32 in = npyv_rint_s32_f32(n);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on how we could rewrite this one @seiko2plus ?

@Mousius Mousius force-pushed the optimized-routines-tan-f32 branch 2 times, most recently from 4ab2bf4 to 2a4b880 Compare May 30, 2023 16:41
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license.

CPU features:
```
NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM
```

Relevant benchmarks:
```
+     45.3±0.07μs       73.0±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f')
+     45.5±0.07μs       73.1±0.1μs     1.61  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f')
+     45.1±0.05μs      68.2±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f')
+     45.3±0.04μs      68.3±0.02μs     1.51  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f')
+     46.4±0.09μs      65.3±0.04μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f')
+     46.5±0.05μs      65.3±0.02μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f')
+      45.4±0.1μs       63.3±0.2μs     1.40  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f')
+     45.2±0.09μs       60.3±0.2μs     1.33  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f')
+     46.3±0.02μs       57.1±0.2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f')
-      127±0.08μs      38.2±0.02μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f')
-      127±0.05μs      38.3±0.01μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f')
-      128±0.05μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f')
-       129±0.3μs      35.6±0.01μs     0.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f')
-      128±0.04μs      32.5±0.01μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f')
-       129±0.4μs         32.5±0μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f')
-      127±0.06μs      31.4±0.02μs     0.25  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f')
-      128±0.09μs      27.8±0.02μs     0.22  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f')
-      128±0.05μs      24.8±0.01μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f')
```
@Mousius Mousius force-pushed the optimized-routines-tan-f32 branch from 2a4b880 to 61fb035 Compare May 30, 2023 20:07
@Mousius Mousius closed this Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant