-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: float64 tan using Numpy intrinsics #23603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This builds on top of #23399 so I expected sin/cos to be changed, and it appears to have similar unrelated SIMD issues:
|
a838920
to
681ef62
Compare
681ef62
to
9429113
Compare
@seiko2plus can you take a look at this one please? I've tried to put the intrinsic code centrally but I see a lot of people have implemented them as part of the various operators, unsure what the strategy is 😸 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few changes, the implementation looks good to me the only issue is that we still going to have the SVML version when it's available (AVX512_SKX & Linux) along with your implementation due to accuracy differences.
@@ -8,6 +8,10 @@ | |||
#define NPY_SIMD_FMA3 1 // native support | |||
#define NPY_SIMD_BIGENDIAN 0 | |||
#define NPY_SIMD_CMPSIGNAL 0 | |||
#ifdef NPY_HAVE_AVX512DQ | |||
#define NPY_SIMD_CVT_F64 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define NPY_SIMD_CVT_F64 1 | |
#define NPY_SIMD_CVT_F64 1 | |
#else | |
#define NPY_SIMD_CVT_F64 0 |
npyv_f64 q = npyv_sub_f64(npyv_muladd_f64(x, two_over_pi, shift), shift); | ||
npyv_s64 qi = npyv_cvt_s64_f64(q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npyv_f64 q = npyv_sub_f64(npyv_muladd_f64(x, two_over_pi, shift), shift); | |
npyv_s64 qi = npyv_cvt_s64_f64(q); | |
npyv_f64 xp_magic = npyv_muladd_f64(x, two_over_pi, shift); | |
npyv_f64 q = npyv_sub_f64(xp_magic, shift); | |
npyv_s64 qi = npyv_sub_s64(npyv_reinterpret_s64_f64(xp_magic), npyv_reinterpret_s64_f64(shift)); |
Wouldn't lead to the same way? faster than integer truncation towards zero and most important supported by all archs.
numpy/core/src/common/simd/estrin.h
Outdated
@@ -0,0 +1,24 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's that common, I suggest moving them close to the context inside (dispatch-able source) or in a separate header inside umath
dir under the prefix simd_
rather than npyv_
in case you plan to share these macros with other sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'll move it to umath, it's not common currently because most of the routines are written in asm for SVML. As we port more of the routines to universal intrinsics you'll see it used often: https://github.com/search?q=repo%3AARM-software%2Foptimized-routines%20estrin&type=code
0x1.7ea75d05b583ep-10, 0x1.289f22964a03cp-11, 0x1.4e4fd14147622p-12 | ||
}; | ||
|
||
npyv_u64 iax = npyv_and_u64(npyv_reinterpret_u64_f64(x), abs_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
npyv_u64 iax = npyv_and_u64(npyv_reinterpret_u64_f64(x), abs_mask); | |
npyv_u64 iax = npyv_reinterpret_u64_f64(npyv_abs_f64 (x)); |
maps to vabsq_f64
on Neon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I completely missed the obvious here 😸
I think it's fine except for numpy/numpy/core/src/common/simd/neon/math.h Line 378 in bce6492
npyv_trunc_s64_f64
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Given perf and accuracy numbers (below), I am not sure if we should retain the SVML implementation for AVX-512. It is a tiny bit faster but having the same implementation across all SIMD archs has its benefits too.
Accuracy:
SVML max ULP error is 3.93 and this one seems to be 3.48. The distributions of ULP errors for values (-1, 1) look to be the same as SVML:
ufunc | 0 ulp | 1 ulp | 2 ulp | 3 ulp |
---|---|---|---|---|
NumPy intrinsic tan | 61. 29% | 37.86 % | 0.85 % | 0.0 % |
SVML tan | 59.80 % | 39.15 % | 1.02 % | 0.02 % |
Perf (on AVX-512):
SVML implementation is a little bit faster:
before after ratio
[921a97e6] [926ed9cd]
<main> <optimized-routines-tan-f64>
+ 184±2μs 245±0.2μs 1.33 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd')
+ 146±0.1μs 192±0.3μs 1.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd')
+ 205±0.7μs 259±1μs 1.26 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd')
+ 209±1μs 240±3μs 1.15 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd')
+ 261±3μs 292±1μs 1.12 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd')
+ 182±0.4μs 202±0.8μs 1.11 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd')
+ 185±1μs 202±1μs 1.09 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd')
+ 144±4μs 153±0.1μs 1.06 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd')
} | ||
#endif | ||
#if @fxnan@ == @len@ | ||
// Workaround, invalid value encountered when x is set to nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you seeing an invalid raised for np.expm1(np.array([np.nan]))
? Wonder why I am not seeing that locally on my SkylakeX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait never mind. Looks like that code was already there :) If there is no way around the function duplication, then I suggest you get rid of the #if @fxnan@ == @len@
section for simd_tan_f32
below, it seems unnecessary.
* #func = exp2, log2, log10, expm1, log1p, cbrt, tan, asin, acos, atan, sinh, cosh, asinh, acosh, atanh# | ||
* #default_val = 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0# | ||
* #fxnan = 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0# | ||
* #func = exp2, log2, log10, expm1, log1p, cbrt, asin, acos, atan, sinh, cosh, asinh, acosh, atanh# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is way to avoid having two of these simd_@func@_@sfx@
functions. Seems like copy exact code.
* #func = tan# | ||
* #intrin = tan# | ||
*/ | ||
NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@func@) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, looks a duplicate of the function below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, sorry, should've put a bit more detail in the PR - we're going to go through all of the routines in https://github.com/ARM-software/optimized-routines, so this is a transient state until I raise the tan f32 PR which moves that out of this file also. I didn't want to do too much gluing together with a condition purely for tan
which I would then remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. May be add a comment? And also get rid of the @fxnan@
section above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, a comment would be a good idea 😸 and potentially removing that bit will make the diff more sensible 🤔
e01dd8c
to
f1cb80e
Compare
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` <main> <optimized-routines-tan-f64> + 82.1±0.03μs 103±0.2μs 1.26 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd') + 81.6±0.03μs 102±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd') + 82.4±0.08μs 103±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd') + 81.9±0.2μs 102±0.3μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd') + 81.5±0.03μs 101±0.4μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd') + 81.6±0.04μs 99.9±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd') + 82.8±0.05μs 101±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd') + 82.4±0.04μs 100±0.3μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd') + 82.6±0.3μs 97.2±0.3μs 1.18 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd') - 200±0.1μs 63.9±0.1μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd') - 199±0.1μs 63.4±0.07μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd') - 201±0.6μs 63.4±0.04μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd') - 200±0.1μs 63.2±0.03μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd') - 200±0.1μs 59.4±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd') - 201±0.2μs 59.3±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd') - 200±0.2μs 57.7±0.03μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd') - 200±0.3μs 57.6±0.02μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd') - 200±0.1μs 53.7±0.01μs 0.27 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd') ```
Use neg instruction if available, that seems safer, else fallback to previously created intrinsics which are now available centrally.
* Removes npyv_cvt functions in favour of subtraction method * Moves Estrin implementation to umath * Minor text fixes to simd.h for negative.h
f1cb80e
to
3c3abfd
Compare
Thanks for the feedback @r-devulap and @seiko2plus, I've updated everything to reflect the conversation, please take another look when you get a chance 😸 |
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` + 45.3±0.07μs 73.0±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f') + 45.5±0.07μs 73.1±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f') + 45.1±0.05μs 68.2±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f') + 45.3±0.04μs 68.3±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f') + 46.4±0.09μs 65.3±0.04μs 1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f') + 46.5±0.05μs 65.3±0.02μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f') + 45.4±0.1μs 63.3±0.2μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f') + 45.2±0.09μs 60.3±0.2μs 1.33 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f') + 46.3±0.02μs 57.1±0.2μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f') - 127±0.08μs 38.2±0.02μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f') - 127±0.05μs 38.3±0.01μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f') - 128±0.05μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f') - 129±0.3μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f') - 128±0.04μs 32.5±0.01μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f') - 129±0.4μs 32.5±0μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f') - 127±0.06μs 31.4±0.02μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f') - 128±0.09μs 27.8±0.02μs 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f') - 128±0.05μs 24.8±0.01μs 0.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f') ```
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` + 45.3±0.07μs 73.0±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f') + 45.5±0.07μs 73.1±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f') + 45.1±0.05μs 68.2±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f') + 45.3±0.04μs 68.3±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f') + 46.4±0.09μs 65.3±0.04μs 1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f') + 46.5±0.05μs 65.3±0.02μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f') + 45.4±0.1μs 63.3±0.2μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f') + 45.2±0.09μs 60.3±0.2μs 1.33 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f') + 46.3±0.02μs 57.1±0.2μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f') - 127±0.08μs 38.2±0.02μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f') - 127±0.05μs 38.3±0.01μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f') - 128±0.05μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f') - 129±0.3μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f') - 128±0.04μs 32.5±0.01μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f') - 129±0.4μs 32.5±0μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f') - 127±0.06μs 31.4±0.02μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f') - 128±0.09μs 27.8±0.02μs 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f') - 128±0.05μs 24.8±0.01μs 0.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f') ```
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` + 45.3±0.07μs 73.0±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f') + 45.5±0.07μs 73.1±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f') + 45.1±0.05μs 68.2±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f') + 45.3±0.04μs 68.3±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f') + 46.4±0.09μs 65.3±0.04μs 1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f') + 46.5±0.05μs 65.3±0.02μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f') + 45.4±0.1μs 63.3±0.2μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f') + 45.2±0.09μs 60.3±0.2μs 1.33 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f') + 46.3±0.02μs 57.1±0.2μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f') - 127±0.08μs 38.2±0.02μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f') - 127±0.05μs 38.3±0.01μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f') - 128±0.05μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f') - 129±0.3μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f') - 128±0.04μs 32.5±0.01μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f') - 129±0.4μs 32.5±0μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f') - 127±0.06μs 31.4±0.02μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f') - 128±0.09μs 27.8±0.02μs 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f') - 128±0.05μs 24.8±0.01μs 0.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f') ```
This builds on top of numpy#23603, and introduces a NumPy intrinsic variant for [float32 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tanf_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` + 45.3±0.07μs 73.0±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'f') + 45.5±0.07μs 73.1±0.1μs 1.61 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'f') + 45.1±0.05μs 68.2±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'f') + 45.3±0.04μs 68.3±0.02μs 1.51 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'f') + 46.4±0.09μs 65.3±0.04μs 1.41 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'f') + 46.5±0.05μs 65.3±0.02μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'f') + 45.4±0.1μs 63.3±0.2μs 1.40 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'f') + 45.2±0.09μs 60.3±0.2μs 1.33 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'f') + 46.3±0.02μs 57.1±0.2μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'f') - 127±0.08μs 38.2±0.02μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'f') - 127±0.05μs 38.3±0.01μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'f') - 128±0.05μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'f') - 129±0.3μs 35.6±0.01μs 0.28 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'f') - 128±0.04μs 32.5±0.01μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'f') - 129±0.4μs 32.5±0μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'f') - 127±0.06μs 31.4±0.02μs 0.25 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'f') - 128±0.09μs 27.8±0.02μs 0.22 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'f') - 128±0.05μs 24.8±0.01μs 0.19 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'f') ```
This builds on top of #23399 and introduces a NumPy intrinsic variant for float64 tan, taken from Optimized Routines under MIT license.
CPU features:
Relevant benchmarks: