-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: float64 sin/cos using Numpy intrinsics #23399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bf46979
to
c1bc90e
Compare
This takes the [sin](https://github.com/ARM-software/optimized-routines/blob/master/math/v_sin.c) and [cos](https://github.com/ARM-software/optimized-routines/blob/master/math/v_cos.c) algorithms from Optimized Routines under MIT license, and converts them to Numpy intrinsics. The routines are within the ULP boundaries of other vectorised math routines (<4ULP). The routines reduce performance in some special cases but improves normal cases. Comparing to the SVML implementation, these routines are more performant in special cases, we're therefore safe to assume the performance is acceptable for AArch64 as well. | performance ratio (lower is better) | benchmark | | ---- | ---- | | 1.8 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 4 2 'd') | | 1.79 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 4 4 'd') | | 1.77 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 4 1 'd') | | 1.74 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 2 2 'd') | | 1.74 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 2 4 'd') | | 1.72 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 2 1 'd') | | 1.6 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 1 2 'd') | | 1.6 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 1 4 'd') | | 1.56 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'> 1 1 'd') | | 1.42 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 2 2 'd') | | 1.41 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 2 4 'd') | | 1.37 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 2 1 'd') | | 1.26 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 4 2 'd') | | 1.26 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 4 4 'd') | | 1.2 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 4 1 'd') | | 1.18 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 1 2 'd') | | 1.18 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 1 4 'd') | | 1.12 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'> 1 1 'd') | | 0.65 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 4 2 'd') | | 0.64 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 2 4 'd') | | 0.64 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 4 4 'd') | | 0.64 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 2 2 'd') | | 0.61 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 1 4 'd') | | 0.61 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 1 2 'd') | | 0.6 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 2 1 'd') | | 0.6 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 4 1 'd') | | 0.56 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'> 1 1 'd') | | 0.52 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 4 2 'd') | | 0.52 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 4 4 'd') | | 0.52 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 2 4 'd') | | 0.52 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 2 2 'd') | | 0.47 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 1 4 'd') | | 0.47 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 1 2 'd') | | 0.46 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 4 1 'd') | | 0.46 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 2 1 'd') | | 0.42 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'> 1 1 'd') | Co-authored-by: Pierre Blanchard <[email protected]>
c1bc90e
to
3a11c0b
Compare
* #op = cos, sin# | ||
*/ | ||
NPY_NOINLINE npyv_f64 | ||
simd_@op@_scalar_f64(npyv_f64 x, npyv_f64 y, npyv_b64 cmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more performant than allowing the compiler to inline the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yip, surprisingly so, I didn't actually want to include it but it helped a lot with the edge cases (iirc, -0.2
to the ratio).
What CPU features were used when benchmarking? Were these the only changed benchmarks when running the ufunc bnehcmarks? |
This relies less on compilers understanding how to create these operations.
@mattip, this is using ASIMD on AArch64, just this branch on top of Command:
Features:
Full Results (I filtered these as they look entirely unrelated yet still changed):
|
Hopefully this works better in the 32-bit builds
9a5eb01
to
8fae1b4
Compare
@mattip I'm a bit stumped by the 32-bit windows failures, are there any guides on how to reproduce them easily? |
The 32-bit windows failures are strange. @seiko2plus could you take a look? |
Did you check benchmarks on a machine that uses SVML? |
From a quick look, I guess the problem is due to not force inlining the following function numpy/numpy/core/src/umath/loops_trigonometric.dispatch.c.src Lines 47 to 50 in 8fae1b4
Mainly because of stack alignment, spilling a register with width larger than 128-bit over 128-bit alignment could cause these strange errors. |
This is to test the theory that noinline is causing stack spilling that collides with alignment in weird ways.
b4fb0ff
to
ca33625
Compare
Yip, it was one of the newer cloud instances. |
Ahhh, I see other references to similar behaviour elsewhere, I've added some logic to mitigate it - thanks for the guidance! |
Documented better here: numpy#18330 (comment)
b5f7d21
to
f5fda54
Compare
Could you add the results to the PR, including the output of @seiko2plus could you review? |
|
Bit from the sidelines, but it seems that non-trig functions for which this PR should be irrelevant become quite a bit slower (like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just minor changes.
if (npyv_any_b64(cmp)) { | ||
/* If fenv exceptions are to be triggered correctly, set any special lanes | ||
to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by | ||
scalar loop later. */ | ||
r = npyv_select_f64(cmp, npyv_setall_f64(1.0), r); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (npyv_any_b64(cmp)) { | |
/* If fenv exceptions are to be triggered correctly, set any special lanes | |
to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by | |
scalar loop later. */ | |
r = npyv_select_f64(cmp, npyv_setall_f64(1.0), r); | |
} | |
/* If fenv exceptions are to be triggered correctly, set any special lanes | |
to 1 (which is neutral w.r.t. fenv). These lanes will be fixed by | |
scalar loop later. */ | |
r = npyv_select_f64(cmp, npyv_setall_f64(1.0), r); |
typo? this branch cost way more than a single blend instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this comes from https://github.com/ARM-software/optimized-routines/blob/master/math/v_sin.c#L69, where originally it was way more optimised for the happy path - will update!
@@ -2,19 +2,25 @@ | |||
** $maxopt baseline | |||
** (avx2 fma3) avx512f | |||
** vsx2 vsx3 vsx4 | |||
** neon_vfpv4 | |||
** asimd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** asimd | |
** neon_vfpv4 |
neon_vfpv4
already implies asimd
on aarch64 since it's part of the hardware baseline so no need for this change.
} | ||
|
||
if (npyv_any_b64(cmp)) { | ||
out = simd_@op@_scalar_f64(x_in, out, cmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out = simd_@op@_scalar_f64(x_in, out, cmp); | |
out = npyv_select_f64(cmp, x_in, out); | |
out = simd_@op@_scalar_f64(out, npyv_tobits_b64(cmp)); |
You can only pass one vector instead, which suppose to perform better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What architecture is that for? On AArch64 function calls can use the vector registers directly (Ref) so it'll end up the same overhead as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree but I meant using one vector to avoid extra element access and passing bitfields instead of the boolean vector should save at least 1OP due to the nature of any/all
and tobits
e.g. on x86(sse/avx) they use movemask
.
This implementation is very close to SVML, for long args SVML uses Payne-Hanek style reduction while this pr falback to libm. |
@Mousius forgot, but if you would like it, you can follow up with adding a brief towncrier fragment for the performance enhancement. |
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` <main> <optimized-routines-tan-f64> + 82.1±0.03μs 103±0.2μs 1.26 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd') + 81.6±0.03μs 102±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd') + 82.4±0.08μs 103±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd') + 81.9±0.2μs 102±0.3μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd') + 81.5±0.03μs 101±0.4μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd') + 81.6±0.04μs 99.9±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd') + 82.8±0.05μs 101±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd') + 82.4±0.04μs 100±0.3μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd') + 82.6±0.3μs 97.2±0.3μs 1.18 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd') - 200±0.1μs 63.9±0.1μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd') - 199±0.1μs 63.4±0.07μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd') - 201±0.6μs 63.4±0.04μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd') - 200±0.1μs 63.2±0.03μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd') - 200±0.1μs 59.4±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd') - 201±0.2μs 59.3±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd') - 200±0.2μs 57.7±0.03μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd') - 200±0.3μs 57.6±0.02μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd') - 200±0.1μs 53.7±0.01μs 0.27 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd') ```
Just a note. It is probably fine, but some downstream packages are noticing the difference in precison in their test suite (which surprised me slightly, because I thought it would be the same as SVML and presumably that had been used in the test suite before). |
Thanks for the heads up @seberg, I also believe this is fine, @PierreBlanchard noted that in reference to the Having said that, I'll keep an eye on the ones that are linking back to this PR, are there others I should be aware of? |
I suppose it's linked above already, but in scipy/scipy#18382, |
Just to add that I also found it a bit surprising that |
You're probably aware of this, but that is a correct result.
is the correct double precision result. Even taking the inexactness of What we're seeing now is that I think we're being caught off guard because the errors, while within the stated bounds, are occurring in a "surprising" location, where we haven't seen them before (namely around Although I'm familiar with all the usual precautions about floating point calculations, I'm still guilty of writing code that expected |
Yip, this aligns with SVML and libmvec as the current expectation for vector math routines. Is there a way to ask the function to dispatch a non-vectorised variant which does the scalar loop instead? That would at least allow the user to trade speed/accuracy themselves? |
How do they compare in practice in terms of accuracy here? I am somewhat curious if any of the lib consideres that results should (also) be correct in terms of inputs rather than output accuracy? I.e. bounded by |
It was not formalised like this in the optimized-routines implementations, as only forward error was considered (not backward errors). But reported errors include the rounding error for the input (+0.5ulp in round to nearest), so do numpy tests I believe. |
It is great to get so much feedback so quickly on this type of design issues. If numpy expects this type of behaviour then there is certainly a way to fix it, but I thought I should add a bit of context. I very much agree that it would be nice to have (as it seems to matter for many users). Unfortunately, as you guessed it, enforcing that comes at a large cost in performance in most cases. I have not looked into it but there might be a way to minimise this cost on this specific case. However, in general it is not something that should be expected from efficient vector implementations of math routines. It seems that this type of considerations is usually left to the user, as each user might have different expectations on the number of cases that have simple/known outputs. Users might also have cheaper options to simplify their code to take that into account. If not then falling back to a scalar implementation is probably the way to go. |
The question is whether the current thing is a safe default for the vast majority of users. Unfortunately, we have to choose for the user. We don't have good user control over it in NumPy. (Yes you can at compile time, and you can also disable most vectorizations using an env variable.). So if this is a bad default, that is a problem. We could expose it maybe, but without new API it would be pretty useless in practice. Having better API might also allow us to be a bit more relaxed about a more inaccurate default. I don't want to argue about the 4 ULP choice. But maybe a question is if/what other practical constraints there are? (E.g. I expect you ensure that the codomain is correct as in |
@pllim also, how weird are the test adjustments in astropy? We had similar fallout with SVML, but I am not sure it was quite as surprising. Below are plots plot I get without svml, with svml and with the new code for sin and a zoom in around EDIT: So basically SVML, is just as bad and jitters around pi/2 also. On average maybe a tiny bit less and for |
Writeup of how the change cascades in mpl's test provided here: matplotlib/matplotlib#25789 (comment) The tl;dr is that |
Re: #23399 (comment)
|
Is our test updates to be more tolerant/avoid inherently unstable/sensitive floating point operations |
Right now, I am operating under the assumption that these are really all somewhat annoying but OK test upgrades and while it might be nice to tweak things to be slightly more precise, it is hopefully OK as is. But, if there are things that break in worse ways than just tests (or seem likely to do so in user code), I still think we should consider dealing or checking whether a small precision bump (via better polynomial factors) helps. |
For astropy at least, some of the test I wrote for spherical to cartesian break just because of the lack of equality of sin(pi/2) to unity. Obviously, all adjustable, but it does seem that a polynomial conditioned on actually spanning the full -1 to 1 inclusive range would be nicer... |
This builds on top of numpy#23399 and introduces a NumPy intrinsic variant for [float64 tan](https://github.com/ARM-software/optimized-routines/blob/master/pl/math/v_tan_3u5.c), taken from Optimized Routines under MIT license. CPU features: ``` NumPy CPU features: NEON NEON_FP16 NEON_VFPV4 ASIMD ASIMDHP* ASIMDDP* ASIMDFHM ``` Relevant benchmarks: ``` <main> <optimized-routines-tan-f64> + 82.1±0.03μs 103±0.2μs 1.26 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 4, 'd') + 81.6±0.03μs 102±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 2, 'd') + 82.4±0.08μs 103±0.2μs 1.25 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 4, 'd') + 81.9±0.2μs 102±0.3μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 2, 'd') + 81.5±0.03μs 101±0.4μs 1.24 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 2, 1, 'd') + 81.6±0.04μs 99.9±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 4, 1, 'd') + 82.8±0.05μs 101±0.4μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 4, 'd') + 82.4±0.04μs 100±0.3μs 1.22 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 2, 'd') + 82.6±0.3μs 97.2±0.3μs 1.18 bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'd') - 200±0.1μs 63.9±0.1μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 4, 'd') - 199±0.1μs 63.4±0.07μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 2, 'd') - 201±0.6μs 63.4±0.04μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 4, 'd') - 200±0.1μs 63.2±0.03μs 0.32 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 2, 'd') - 200±0.1μs 59.4±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 4, 'd') - 201±0.2μs 59.3±0.03μs 0.30 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 2, 'd') - 200±0.2μs 57.7±0.03μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 4, 1, 'd') - 200±0.3μs 57.6±0.02μs 0.29 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 2, 1, 'd') - 200±0.1μs 53.7±0.01μs 0.27 bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'd') ```
This takes the sin and cos algorithms from Optimized Routines under MIT license, and converts them to Numpy intrinsics.
The routines are within the ULP boundaries of other vectorised math routines (<4ULP). The routines reduce performance in some special cases but improves normal cases. Comparing to the SVML implementation, these routines are more performant in special cases, we're therefore safe to assume the performance is acceptable for AArch64 as well.