-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Convert fp32 sin/cos from C universal intrinsics to C++ using Highway #25781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
opmask_t nnan_mask = hn::Not(hn::IsNaN(x_in)); | ||
// Eliminate NaN to avoid FP invalid exception | ||
x_in = hn::IfThenElse(nnan_mask, x_in, zerosf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be wrapped in #if NPY_SIMD_CMPSIGNAL
, which is 0
on AVX512 and AVX2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had test failures which got resolved when I got rid of it. Need to figure that out.
} | ||
} | ||
if (simd_maski != (npy_uint64)((1 << lanes) - 1)) { | ||
float ip_fback[hn::Lanes(f32)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if the benchmarks hit this case often, but it'd be worth checking this compiles to a vector without the alignment attributes (I think the Highway thing here would be HWY_ALIGN
)
} | ||
|
||
NPY_FINLINE vec_f32 | ||
GatherIndexN(const float* src, npy_intp ssrc, npy_intp len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not massively familiar with x86, but it looks like npyv_loadn_tillz_s32
uses a gather instruction here rather than a loop, should try it to see if it provides similar performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1.5x slowdown actually happens in the non strided case which don't use Gather or Scatter. But even for the strided case, using gather is slower than scalar method (since the DOWNFALL CVE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, apologies, I wasn't sure if the code generated for Gather was pretty slow on x86 as well, it doesn't look as good as the ASIMD npyv_
function 😅 Too many architectures 😸
Ok, so I played a bit of spot the difference and left some comments, and I ran some benchmarks quickly - it looks like with ASIMD, there's regressions with these:
This indicates the gather/scatter aren't as optimal as the NumPy ones; I wonder if we can blend the NumPy loads and stores with the Highway code here 🤔 |
meson_cpu/x86/meson.build
Outdated
@@ -60,7 +60,7 @@ FMA3 = mod_features.new( | |||
test_code: files(source_root + '/numpy/distutils/checks/cpu_fma3.c')[0] | |||
) | |||
AVX2 = mod_features.new( | |||
'AVX2', 25, implies: F16C, args: '-mavx2', | |||
'AVX2', 25, implies: F16C, args: '-march=skylake', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is more related to implying -mtune=skylake
? Highway relies on the compiler to do some optimisations, and I do not know what -mtune=generic
does with just -mavx2
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked at this a bit more deeply, and it looks like with just -mavx2
you don't get HWY_AVX2
, these flags worked to get HWY_AVX2
:
-mpclmul -maes -mavx -mavx2 -mbmi -mbmi2 -mfma -mf16c
I assume they're all implied by haswell
but avx2
is actually way more limiting. Is there a processor which supports avx2
without all these things? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yet, the surprising thing was using -march=haswell
makes the performance a lot worse (nearly 4x slower). I need some input from @jan-wassenberg to see what is happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are exactly right, we require more flags than just -march=haswell: also -maes, then it is sufficient for HWY_AVX2.
Unfortunately there are some very few Haswells without AES, so we do check for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: why do you need -maes
for HWY_AVX2 if you aren't using any AES related instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AESRound is a supported Highway op; we do not know whether users will use it. I suppose the options are to detect at runtime (but that would add considerable overhead), or add another target for AVX2 \ AES, but this is very rare, so not worthwhile, right?
That is extremely likely. My Gather/Scatter were just a quick and dirty way to make this work. I will eventually move to using the highway implementation. |
Please explain to me, where did this conclusion come from? the current speed-up is related to special cases (the libm fallback has been improved somehow) which affects both contiguous and non-contiguous. So I think maybe the regression is related to it. |
@seiko2plus I am seeing slowdown for strided cases as well. I just meant this could be a result of my
|
Right, Highway checks for multiple CPU flags before using AVX3. -march=skx is sufficient here, but for the individual -mavx512, that would require a long list. |
* these numbers | ||
*/ | ||
if (!hn::AllFalse(f32, simd_mask)) { | ||
vec_f32 x = hn::IfThenElse(hn::And(nnan_mask, simd_mask), x_in, zerosf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use IfThenElseZero here and above if you like.
cos = hn::IfThenElse(sine_mask, sin, cos); | ||
|
||
// multiply by -1 for appropriate elements | ||
opmask_t negate_mask = hn::RebindMask(f32, hn::Eq(hn::And(iquadrant, twos), twos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is IfNegativeThenNegOrUndefIfZero faster than these two lines?
ScatterIndexN(cos, dst, sdst, len); | ||
} | ||
} | ||
if (simd_maski != (npy_uint64)((1 << lanes) - 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StoreMaskBits is not necessarily cheap. Recommend instead using CountTrue to get the number instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now using if (!hn::AllTrue(f32, simd_mask))
instead and moved the StoreMaskBits
inside the if condition (compute only when required). I am afraid CountTrue
won't be sufficient here. We also need to know which specific lanes have bit set to 1.
c0ea944
to
bb36792
Compare
Moving the
|
|
||
for (; len > 0; len -= lanes, src += ssrc*lanes, dst += sdst*lanes) { | ||
vec_f32 x_in; | ||
if (ssrc == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadN should only be used in the tail of a loop. It is probably worthwhile replicating the body of the loop (e.g. by moving it to a helper function), with most iterations using a normal load, and only the last iteration using LoadN.
} else { | ||
x_in = GatherIndexN(src, ssrc, len); | ||
} | ||
opmask_t nnan_mask = hn::Not(hn::IsNaN(x_in)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid the Not here by swapping the order of args to subsequent IfThenElse that use this, and also replacing And(nnan_mask, ..) with AndNot(nan_mask, ..).
cos = hn::IfThenElse(nnan_mask, cos, hn::Set(f32, NPY_NANF)); | ||
|
||
if (sdst == 1) { | ||
hn::StoreN(cos, f32, dst, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with LoadN, StoreN should only be called in the last iteration.
ScatterIndexN(cos, dst, sdst, len); | ||
} | ||
} | ||
if (!hn::AllTrue(f32, simd_mask)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider HWY_UNLIKELY annotation?
Cool, I have added some additional suggestions :) |
@jan-wassenberg just looking at the CI failures and it seems there's some compiler incompatibility on PPC (https://github.com/numpy/numpy/actions/runs/8087779139/job/22100478961?pr=25781#step:8:504) and an abort on Z13 (https://github.com/numpy/numpy/actions/runs/8087779139/job/22100479504?pr=25781) any ideas? There's also a failure on |
hm, the Z13 error is The latter at least I can help with. We are missing HWY_ATTR:
|
Adding HWY_ATTR fixed the build errors on ppc64le. Why did it fail only for this platform though? Need help with debugging 3 more failures:
|
hm, strange. Neither the x86 implementation of TableLookupBytesOr0, nor the quoted line and the one before it, have a numeric constant. Which compiler is cygwin using? |
From the logs:
|
hm, here we are able to compile StoreInterleaved3 using GCC 11.4. Are you able to repro the issue in godbolt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🥳
Huge milestone, thanks @r-devulap! |
This was a big lift, thank you all and congrats on getting it over the finish line! |
@r-devulap, it's great to see progress on moving to Highway! I noticed that you've disabled dispatching for the CPU features of ppc64 and zSystem, which not going to prevent baseline capabilities. Have you opened any issues related to Highway or NumPy reporting regarding this? |
@seiko2plus good to hear from you again!
Good point, we haven't. Perhaps highway is the right place? We aren't sure where the bug is coming from though. |
hm, what's the issue we were addressing by removing the VSX etc? |
@jan-wassenberg IIRC, it was a seg fault. We should see the failure in #27627 |
For clarification, SIMD optimizations for sine and cosine functions on both ppc64 and z/Architecture (IBM Z) were disabled by gh-25781 to bypass CI tests. This PR aims to re-enable optimizations for z/Architecture after addressing the following runtime errors, while gh-27627 re-enabled ppc64 optimizations. * Re-enable VXE for sin/cos HWY implementation --------- Co-authored-by: Sayed Adel <[email protected]>
This patch is to experiment with highway and see how we can leverage its intrinsics using static dispatch. I would think these are the minimum requirements:
On x86, both AVX-512 and AVX2 seem to have performance regressions. Yet to figure out where they are coming from.
AVX-512 benchmarks
These are about 1.5x slower even when built with
-march=skylake-avx512
. If we use just-mavx512f -mavx512bw
, etc, then its about 4x slower.AVX2 benchmarks
These are about 1.34x slower when built using
-march=skylake
. If we use-maxv2
or even-march=haswell
, then these seem to be 4x slower.