-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH, SIMD: Add CPU feature detection and simd functions for AArch64 SVE #22265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8c8425a
to
c9dd736
Compare
Thanks a lot for this effort! Do you by any chance have any performance benchmarks? (see maybe benchmark docs) |
Compile-time sizeless SIMD extensions should be treated as it designed. providing compiled objects for each possible width (256, 512, 1024, 2048) going to increase the binary size and maintenance efforts. note the current SVE implementation only supports 512bit width. IMHO, it's better to wait till we get done with the C++ interface of universal intrinsics which is designed to support sizeless SIMD extensions(#21057) since we are moving to C++ anyway. However, we still can modify the C interface to make it friendly with sizeless SIMD extensions. thoughts? @charris, @rgommers, @mattip |
Hi, @seiko2plus Thank you for the comment. When I tried to implement in a size-less manner, I couldn't implement the following part (typedef union "simd_data"). If this part can be solved, I think all other parts (core/src/common/simd/sve/(conversion|memory).h) can be done in a size-less manner. |
Hi, @EwoutH In my environment (512bit SVE), I've confirmed more than three times faster performance gain depending on the type of benchmark. I've also observed a performance drop of a few percent on some benchmarks. I need in-company confirmation to disclose absolute processing time of the benchmark. Could you give me a week of time? |
Hi @kawakami-k, The I would suggest to postponed your current work for 1-2 months till we get done from #2105, |
In case you don't get permission to post absolute numbers, you could perhaps take the output of |
Hi, @rgommers
Thank you for the comment. 1) One idea is to disclose the relative performance. 2) As another idea, I'm preparing to run benchmarks on AWS Graviton3 (SVE 256bit). For now, I'm going to do with 2). Since I will not have much time in September, I will measure and compare benchmark in early October. Thank you. |
Hi, @seiko2plus
I haven't had time to understand the new C++ interface proposal, but it's no problem to modify this PR to fit the new interface. Thank you. |
This sounds really awesome! Are you able to share any numbers? If absolute figures (seconds) aren't possible, you could also share speedups compared to NEON or to plain C (so 2.39x speed for example) |
6cf0329
to
a200644
Compare
The below is the benchmark result on AWS Graviton3 (SVE 256).
The source code I used is 86cd584b and ffe9cf2c. The implementation has been changed to be as SIMD size-independent as possible,
|
6694061
to
99ca3cd
Compare
@seiko2plus, we're now past the timeframe suggested here, is there any way to unblock this PR? It'd be great to be able to leverage SVE as we migrate the other routines to universal intrinsics. |
This PR enhances the CPU feature detection function so as to detect Arm SVE architecture. It also includes vectorized functions for SVE that is implemented similarly to AVX/AVX2/ASIMD. The regression test (runtest.py) was executed on Fujtisu FX700 with A64FX which is one of Armv8.2a + SVE architecture compliant CPU. The result was "21354 passed, 203 skipped, 1302 deselected, 30 xfailed, 7 xpassed".
Because SVE2 (Armv9) is upper-compatible instruction set of SVE, I believe this PR also improves NumPy preformance running on SVE2 environment.