Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Convert fp32 sin/cos from C universal intrinsics to C++ using Highway #25781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 26, 2024

Conversation

r-devulap
Copy link
Member

This patch is to experiment with highway and see how we can leverage its intrinsics using static dispatch. I would think these are the minimum requirements:

  • passes CI on all the relevant platforms: AVX512_SKX, [AVX2, FMA3], VSX4, VSX3, VSX2, NEON_VFPV4, VXE2, VX. All tests pass on my local AVX512 and AVX2 machines.
  • No performance regressions.

On x86, both AVX-512 and AVX2 seem to have performance regressions. Yet to figure out where they are coming from.

AVX-512 benchmarks

These are about 1.5x slower even when built with -march=skylake-avx512. If we use just -mavx512f -mavx512bw, etc, then its about 4x slower.

| Change   | Before [5867ee6b] <main>   | After [d5596c17] <sincos-hwy>   |   Ratio | Benchmark (Parameter)                                                   |
|----------|----------------------------|---------------------------------|---------|-------------------------------------------------------------------------|
| +        | 7.48±0μs                   | 11.5±0.01μs                     |    1.53 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')        |
| +        | 7.70±0.06μs                | 11.5±0.02μs                     |    1.49 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')        |
| +        | 11.8±0.1μs                 | 16.3±0.04μs                     |    1.39 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')        |
| +        | 11.9±0.1μs                 | 16.3±0.09μs                     |    1.38 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')        |
| +        | 23.2±0.06μs                | 28.0±0.01μs                     |    1.21 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')        |
| +        | 23.3±0.2μs                 | 28.1±0.01μs                     |    1.21 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')        |
| +        | 20.4±0.02μs                | 24.4±0.09μs                     |    1.2  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f')        |
| +        | 20.3±0.01μs                | 24.5±0.02μs                     |    1.2  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f')        |
| -        | 66.1±0.02μs                | 62.2±0.07μs                     |    0.94 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f') |
| -        | 67.2±0.01μs                | 62.2±0.03μs                     |    0.93 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f') |
| -        | 64.6±0.03μs                | 59.1±0.01μs                     |    0.91 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 1, 'f') |
| -        | 65.2±0.02μs                | 59.1±0.04μs                     |    0.91 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 1, 'f') |
| -        | 60.6±0.04μs                | 54.4±0.02μs                     |    0.9  | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 2, 'f') |
| -        | 59.0±0.05μs                | 51.6±0.04μs                     |    0.88 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'f') |
| -        | 61.5±0.03μs                | 54.1±0.06μs                     |    0.88 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 2, 'f') |
| -        | 59.9±0.01μs                | 51.8±0.05μs                     |    0.86 | bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'f') |

AVX2 benchmarks

These are about 1.34x slower when built using -march=skylake. If we use -maxv2 or even -march=haswell, then these seem to be 4x slower.

| Change   | Before [5867ee6b] <main>   | After [d5596c17] <sincos-hwy>   |   Ratio | Benchmark (Parameter)                                            |
|----------|----------------------------|---------------------------------|---------|------------------------------------------------------------------|
| +        | 12.7±0.1μs                 | 17.1±0.1μs                      |    1.34 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f') |
| +        | 13.1±0.01μs                | 17.2±0.2μs                      |    1.31 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f') |
| +        | 40.5±0.02μs                | 49.7±0.2μs                      |    1.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f') |
| +        | 36.7±0.2μs                 | 45.1±0.2μs                      |    1.23 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 1, 'f') |
| +        | 37.2±0.04μs                | 45.2±0.4μs                      |    1.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 1, 'f') |
| +        | 40.3±0.01μs                | 49.3±0.2μs                      |    1.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f') |
| +        | 18.5±0.1μs                 | 22.2±0.2μs                      |    1.2  | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f') |
| +        | 18.5±0.02μs                | 21.8±0.2μs                      |    1.18 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f') |

@Mousius
Copy link
Member

Mousius commented Feb 7, 2024

cc @jan-wassenberg

}
opmask_t nnan_mask = hn::Not(hn::IsNaN(x_in));
// Eliminate NaN to avoid FP invalid exception
x_in = hn::IfThenElse(nnan_mask, x_in, zerosf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be wrapped in #if NPY_SIMD_CMPSIGNAL, which is 0 on AVX512 and AVX2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had test failures which got resolved when I got rid of it. Need to figure that out.

}
}
if (simd_maski != (npy_uint64)((1 << lanes) - 1)) {
float ip_fback[hn::Lanes(f32)];
Copy link
Member

@Mousius Mousius Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if the benchmarks hit this case often, but it'd be worth checking this compiles to a vector without the alignment attributes (I think the Highway thing here would be HWY_ALIGN)

}

NPY_FINLINE vec_f32
GatherIndexN(const float* src, npy_intp ssrc, npy_intp len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not massively familiar with x86, but it looks like npyv_loadn_tillz_s32 uses a gather instruction here rather than a loop, should try it to see if it provides similar performance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1.5x slowdown actually happens in the non strided case which don't use Gather or Scatter. But even for the strided case, using gather is slower than scalar method (since the DOWNFALL CVE).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, apologies, I wasn't sure if the code generated for Gather was pretty slow on x86 as well, it doesn't look as good as the ASIMD npyv_ function 😅 Too many architectures 😸

@Mousius
Copy link
Member

Mousius commented Feb 7, 2024

Ok, so I played a bit of spot the difference and left some comments, and I ran some benchmarks quickly - it looks like with ASIMD, there's regressions with these:

bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 4, 2, 'f')
bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 4, 2, 'f')
bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 4, 2, 'f')
bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 4, 2, 'f')
bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')

This indicates the gather/scatter aren't as optimal as the NumPy ones; I wonder if we can blend the NumPy loads and stores with the Highway code here 🤔

@@ -60,7 +60,7 @@ FMA3 = mod_features.new(
test_code: files(source_root + '/numpy/distutils/checks/cpu_fma3.c')[0]
)
AVX2 = mod_features.new(
'AVX2', 25, implies: F16C, args: '-mavx2',
'AVX2', 25, implies: F16C, args: '-march=skylake',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is more related to implying -mtune=skylake? Highway relies on the compiler to do some optimisations, and I do not know what -mtune=generic does with just -mavx2 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked at this a bit more deeply, and it looks like with just -mavx2 you don't get HWY_AVX2, these flags worked to get HWY_AVX2:

-mpclmul -maes -mavx -mavx2 -mbmi -mbmi2 -mfma -mf16c

I assume they're all implied by haswell but avx2 is actually way more limiting. Is there a processor which supports avx2 without all these things? 🤔

Copy link
Member Author

@r-devulap r-devulap Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yet, the surprising thing was using -march=haswell makes the performance a lot worse (nearly 4x slower). I need some input from @jan-wassenberg to see what is happening here.

Copy link
Contributor

@jan-wassenberg jan-wassenberg Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are exactly right, we require more flags than just -march=haswell: also -maes, then it is sufficient for HWY_AVX2.
Unfortunately there are some very few Haswells without AES, so we do check for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why do you need -maes for HWY_AVX2 if you aren't using any AES related instructions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AESRound is a supported Highway op; we do not know whether users will use it. I suppose the options are to detect at runtime (but that would add considerable overhead), or add another target for AVX2 \ AES, but this is very rare, so not worthwhile, right?

@r-devulap
Copy link
Member Author

This indicates the gather/scatter aren't as optimal as the NumPy ones; I wonder if we can blend the NumPy loads and stores with the Highway code here 🤔

That is extremely likely. My Gather/Scatter were just a quick and dirty way to make this work. I will eventually move to using the highway implementation.

@seiko2plus
Copy link
Member

seiko2plus commented Feb 7, 2024

Please explain to me, where did this conclusion come from? the current speed-up is related to special cases (the libm fallback has been improved somehow) which affects both contiguous and non-contiguous. So I think maybe the regression is related to it.

@r-devulap
Copy link
Member Author

@seiko2plus I am seeing slowdown for strided cases as well. I just meant this could be a result of my GatherIndexN and ScatterIndexN functions which just perform a simple scalar loop to load and store. It's only a guess, I will need to take a deeper look into it.

| +        | 11.8±0.1μs                 | 16.3±0.04μs                     |    1.39 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 2, 'f')        |
| +        | 11.9±0.1μs                 | 16.3±0.09μs                     |    1.38 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 2, 'f')   

@jan-wassenberg
Copy link
Contributor

-march=skylake-avx512. If we use just -mavx512f -mavx512bw, etc, then its about 4x slower.

Right, Highway checks for multiple CPU flags before using AVX3. -march=skx is sufficient here, but for the individual -mavx512, that would require a long list.

* these numbers
*/
if (!hn::AllFalse(f32, simd_mask)) {
vec_f32 x = hn::IfThenElse(hn::And(nnan_mask, simd_mask), x_in, zerosf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use IfThenElseZero here and above if you like.

cos = hn::IfThenElse(sine_mask, sin, cos);

// multiply by -1 for appropriate elements
opmask_t negate_mask = hn::RebindMask(f32, hn::Eq(hn::And(iquadrant, twos), twos));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is IfNegativeThenNegOrUndefIfZero faster than these two lines?

ScatterIndexN(cos, dst, sdst, len);
}
}
if (simd_maski != (npy_uint64)((1 << lanes) - 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreMaskBits is not necessarily cheap. Recommend instead using CountTrue to get the number instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now using if (!hn::AllTrue(f32, simd_mask)) instead and moved the StoreMaskBits inside the if condition (compute only when required). I am afraid CountTrue won't be sufficient here. We also need to know which specific lanes have bit set to 1.

@r-devulap
Copy link
Member Author

r-devulap commented Feb 28, 2024

Moving the hn::StoreMaskBits to inside the if condition helped perf by a little bit, now we are just about 1.2x slower.

| +        | 7.47±0.01μs                | 9.11±0.05μs                     |    1.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')        |
| +        | 7.68±0.03μs                | 9.32±0.04μs                     |    1.21 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')        |


for (; len > 0; len -= lanes, src += ssrc*lanes, dst += sdst*lanes) {
vec_f32 x_in;
if (ssrc == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadN should only be used in the tail of a loop. It is probably worthwhile replicating the body of the loop (e.g. by moving it to a helper function), with most iterations using a normal load, and only the last iteration using LoadN.

} else {
x_in = GatherIndexN(src, ssrc, len);
}
opmask_t nnan_mask = hn::Not(hn::IsNaN(x_in));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid the Not here by swapping the order of args to subsequent IfThenElse that use this, and also replacing And(nnan_mask, ..) with AndNot(nan_mask, ..).

cos = hn::IfThenElse(nnan_mask, cos, hn::Set(f32, NPY_NANF));

if (sdst == 1) {
hn::StoreN(cos, f32, dst, len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with LoadN, StoreN should only be called in the last iteration.

ScatterIndexN(cos, dst, sdst, len);
}
}
if (!hn::AllTrue(f32, simd_mask)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider HWY_UNLIKELY annotation?

@jan-wassenberg
Copy link
Contributor

Moving the hn::StoreMaskBits to inside the if condition helped perf by a little bit, now we are just about 1.2x slower.

| +        | 7.47±0.01μs                | 9.11±0.05μs                     |    1.22 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'f')        |
| +        | 7.68±0.03μs                | 9.32±0.04μs                     |    1.21 | bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'f')        |

Cool, I have added some additional suggestions :)

@Mousius
Copy link
Member

Mousius commented Mar 1, 2024

@jan-wassenberg just looking at the CI failures and it seems there's some compiler incompatibility on PPC (https://github.com/numpy/numpy/actions/runs/8087779139/job/22100478961?pr=25781#step:8:504) and an abort on Z13 (https://github.com/numpy/numpy/actions/runs/8087779139/job/22100479504?pr=25781) any ideas?

There's also a failure on armhf but I can take a look at that when I get a minute (https://github.com/numpy/numpy/actions/runs/8087779139/job/22100478151?pr=25781)

@jan-wassenberg
Copy link
Contributor

hm, the Z13 error is realloc(): invalid next size Fatal Python error: Aborted
Seems unrelated to SIMD; are we possibly corrupting the heap?

The latter at least I can help with. We are missing HWY_ATTR:

Additionally, each function that calls Highway ops (such as Load) must either be prefixed with HWY_ATTR, OR reside between HWY_BEFORE_NAMESPACE() and HWY_AFTER_NAMESPACE(). Lambda functions currently require HWY_ATTR before their opening brace.

@r-devulap
Copy link
Member Author

The latter at least I can help with. We are missing HWY_ATTR:

Adding HWY_ATTR fixed the build errors on ppc64le. Why did it fail only for this platform though?

Need help with debugging 3 more failures:

  1. The crash on s390x platforms. Ping @seiko2plus
  2. armhf failure: looks like accuracy problems ping @Mousius
  3. @jan-wassenberg any idea why the cygwin CI fails with this error?
../numpy/_core/src/highway/hwy/ops/generic_ops-inl.h: In function 'void hwy::N_AVX2::StoreInterleaved3(hwy::N_AVX2::VFromD<D>, hwy::N_AVX2::VFromD<D>, hwy::N_AVX2::VFromD<D>, D, hwy::N_AVX2::TFromD<D>*)':
../numpy/_core/src/highway/hwy/ops/generic_ops-inl.h:1320:14: error: expected unqualified-id before numeric constant
 1320 |   const auto B0 = TableLookupBytesOr0(v0, shuf_B0);

@jan-wassenberg
Copy link
Contributor

hm, strange. Neither the x86 implementation of TableLookupBytesOr0, nor the quoted line and the one before it, have a numeric constant. Which compiler is cygwin using?

@r-devulap
Copy link
Member Author

hm, strange. Neither the x86 implementation of TableLookupBytesOr0, nor the quoted line and the one before it, have a numeric constant. Which compiler is cygwin using?

From the logs:

C++ compiler for the host machine: c++ (gcc 11.4.0 "c++ (GCC) 11.4.0")
C++ linker for the host machine: c++ ld.bfd 2.42

@jan-wassenberg
Copy link
Contributor

hm, here we are able to compile StoreInterleaved3 using GCC 11.4. Are you able to repro the issue in godbolt?

@Mousius Mousius added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Mar 28, 2024
Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🥳

@Mousius Mousius merged commit c46a513 into numpy:main Jun 26, 2024
68 checks passed
@Mousius
Copy link
Member

Mousius commented Jun 26, 2024

Huge milestone, thanks @r-devulap!

@rgommers
Copy link
Member

This was a big lift, thank you all and congrats on getting it over the finish line!

@seiko2plus
Copy link
Member

seiko2plus commented Oct 22, 2024

@r-devulap, it's great to see progress on moving to Highway! I noticed that you've disabled dispatching for the CPU features of ppc64 and zSystem, which not going to prevent baseline capabilities. Have you opened any issues related to Highway or NumPy reporting regarding this?

@r-devulap
Copy link
Member Author

@seiko2plus good to hear from you again!

Have you opened any issues related to Highway or NumPy reporting regarding this?

Good point, we haven't. Perhaps highway is the right place? We aren't sure where the bug is coming from though.

@jan-wassenberg
Copy link
Contributor

hm, what's the issue we were addressing by removing the VSX etc?

@r-devulap
Copy link
Member Author

@seiko2plus #27627

@jan-wassenberg IIRC, it was a seg fault. We should see the failure in #27627

seberg pushed a commit that referenced this pull request Nov 22, 2024
For clarification, SIMD optimizations for sine and cosine functions on both ppc64 and z/Architecture (IBM Z) were disabled by gh-25781 to bypass CI tests. This PR aims to re-enable optimizations for z/Architecture after addressing the following runtime errors, while gh-27627 re-enabled ppc64 optimizations.

* Re-enable VXE for sin/cos HWY implementation

---------

Co-authored-by: Sayed Adel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants