-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: float32 tan using NumPy intrinsics #23844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` <main> <optimized-routines-tan-f64> + 82.1±0.03μs 103±0.2μs 1.26 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd') + 81.6±0.03μs 102±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd') + 82.4±0.08μs 103±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd') + 81.9±0.2μs 102±0.3μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd') + 81.5±0.03μs 101±0.4μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd') + 81.6±0.04μs 99.9±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd') + 82.8±0.05μs 101±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd') + 82.4±0.04μs 100±0.3μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd') + 82.6±0.3μs 97.2±0.3μs 1.18 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd') - 200±0.1μs 63.9±0.1μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd') - 199±0.1μs 63.4±0.07μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd') - 201±0.6μs 63.4±0.04μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd') - 200±0.1μs 63.2±0.03μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd') - 200±0.1μs 59.4±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd') - 201±0.2μs 59.3±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd') - 200±0.2μs 57.7±0.03μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd') - 200±0.3μs 57.6±0.02μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd') - 200±0.1μs 53.7±0.01μs 0.27 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd') ```
Use neg instruction if available, that seems safer, else fallback to previously created intrinsics which are now available centrally.
* Removes npyv_cvt functions in favour of subtraction method * Moves Estrin implementation to umath * Minor text fixes to simd.h for negative.h
Mousius
commented
May 30, 2023
npyv_f32 n = npyv_sub_f32(q, shift); | ||
|
||
/* n is representable as a signed integer, simply convert it. */ | ||
npyv_s32 in = npyv_rint_s32_f32(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas on how we could rewrite this one @seiko2plus ?
4ab2bf4
to
2a4b880
Compare
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` + 45.3±0.07μs 73.0±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f') + 45.5±0.07μs 73.1±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f') + 45.1±0.05μs 68.2±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f') + 45.3±0.04μs 68.3±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f') + 46.4±0.09μs 65.3±0.04μs 1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f') + 46.5±0.05μs 65.3±0.02μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f') + 45.4±0.1μs 63.3±0.2μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f') + 45.2±0.09μs 60.3±0.2μs 1.33 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f') + 46.3±0.02μs 57.1±0.2μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f') - 127±0.08μs 38.2±0.02μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f') - 127±0.05μs 38.3±0.01μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f') - 128±0.05μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f') - 129±0.3μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f') - 128±0.04μs 32.5±0.01μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f') - 129±0.4μs 32.5±0μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f') - 127±0.06μs 31.4±0.02μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f') - 128±0.09μs 27.8±0.02μs 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f') - 128±0.05μs 24.8±0.01μs 0.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f') ```
2a4b880
to
61fb035
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This builds on top of #23603, and introduces a NumPy intrinsic variant for float32 tan, taken from Optimized Routines under MIT license.
CPU features:
Relevant benchmarks: