Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Add support SLEEF for transcendental functions #23068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yamadafuyuka opened this issue Jan 23, 2023 · 10 comments
Open

ENH: Add support SLEEF for transcendental functions #23068

yamadafuyuka opened this issue Jan 23, 2023 · 10 comments
Labels
component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Comments

@yamadafuyuka
Copy link

Functions such as sin and log use libm except for AVX512_SKX, and at least in my environment SIMD instruction were not used.
Therefore, I added implementation to use SIMD library SLEEF ( https://sleef.org/ ) and measured the calculation time of some functions.
My branch: ( https://github.com/yamadafuyuka/numpy/tree/add_SLEEF )

I graphed the results. We also confirmed that using SVE intrinsics as in ( PR-22265 ) further speeds up (the log10 function is about 4 times faster).
I would like to add SLEEF support, but I am not sure which part of NumPy is the best place to implement it. Could you please advise?

@yamadafuyuka yamadafuyuka changed the title ENH: Add support SLEEF ENH: Add support SLEEF for transcendental functions Jan 23, 2023
@mattip
Copy link
Member

mattip commented Jan 23, 2023

We are trying to move towards using universal intrinsics inside NumPy, so I am not sure we would want to mix in a whole new library with a different paradigm.

What version of NumPy did you test? On what platform?

@kawakami-k
Copy link
Contributor

kawakami-k commented Jan 23, 2023

Hi, @mattip
I'm implimenting SVE support for Numpy with @yamadafuyuka .

The motivation of this issue is to improve the calculation speed of transcendental functions such as sin/cos/tan/log2/log10/exp and etc. For x64 on Linux, NumPy can be built with SVML and calculation is vectorized. In my understanding, for non-x64 CPUs, compilers links NumPy with libmath.so for the transcendental functions, that provides not-vectorized transcendental functions. Is this right?

SLEEF is one of the vectorized mathematical library. It supports multiple architecture as show in Table 1.1. Becuase the function names of SLEEF follow its naming convention, it is easy to abstract function name and write source code for multiple architectures/multiple instruction sets. The below is a function name example. u10 means that the function achieves 1.0-ULP calculation accuracy.

Transcendental function Data type ISA SLEEF function name
sine float Arm NEON Sleef_sinf4_u10
sine float Arm SVE Sleef_sinfx_u10sve
sine float x64 AVX512 Sleef_sinf16_u10
sine dobuel Arm NEON Sleef_sind2_u10
sine double Arm SVE Sleef_sindx_u10sve
sine double x64 AVX512 Sleef_sind8_u10

NumPy has the universal intrinsic for multiple ISAs, so I think there is a way to use this to implement transcendental functions in a unified way. However, it would be time-consuming and difficult to implement various transcendental functions. I think it would be a good idea to divert SLEEF.

Since transcendental function processing is vectorized by SLEEF, the expected performance gain will be close to N, where N means the number of SIMD lanes. In practice, the gain will be smaller than N due to Python and other overhead.

Thank you.

@mattip
Copy link
Member

mattip commented Jan 23, 2023

What version of NumPy did you test to get your performance graphs? On what platform? We have already moved some of these functions to universal intrinsics, which is why I ask for exact platform and version information. It would be great if you could report import sys, numpy; print(numpy.__version__); print(sys.version) If you are running NumPy 1.24+, also show print(numpy.show_runtime())

@yamadafuyuka
Copy link
Author

yamadafuyuka commented Jan 24, 2023

Thank you for your comment. Sorry for the late reply.
The environment is as follows:.

  • NumPy version: 1.23.3
  • Platform: AArch64
  • CPU: A64FX (Armv8.2a + SVE)
>>> print(numpy.__version__)
0+untagged.28802.g57e71fd   // Edited "1.23.3-release" version  e47cbb69b
>>> print(sys.version)
3.10.7 (main, Oct  4 2022, 00:38:28) [GCC 11.3.0]

@yamadafuyuka
Copy link
Author

yamadafuyuka commented Jan 24, 2023

I am sorry for the insufficient explanation.

For the functions defined in numpy/numpy/core/src/umath/loops_umath_fp.dispatch.c.src, I want to use the SLEEF in other architectures the same way AVX512 uses the SVML.
In the current implementation, except for AVX512, NumPy uses the functions of #include <math.h>, which are scalar functions, right?

NPY_NO_EXPORT void NPY_CPU_DISPATCH_CURFX(@TYPE@_@func@)
(char **args, npy_intp const *dimensions, npy_intp const *steps, void *NPY_UNUSED(data))
{
#if NPY_SIMD && defined(NPY_HAVE_AVX512_SKX) && defined(NPY_CAN_LINK_SVML)
const @type@ *src = (@type@*)args[0];
@type@ *dst = (@type@*)args[1];
const int lsize = sizeof(src[0]);
const npy_intp ssrc = steps[0] / lsize;
const npy_intp sdst = steps[1] / lsize;
const npy_intp len = dimensions[0];
assert(len <= 1 || (steps[0] % lsize == 0 && steps[1] % lsize == 0));
if (!is_mem_overlap(src, steps[0], dst, steps[1], len) &&
npyv_loadable_stride_@sfx@(ssrc) &&
npyv_storable_stride_@sfx@(sdst)) {
simd_@intrin@_@sfx@(src, ssrc, dst, sdst, len);
return;
}
#endif
UNARY_LOOP {
const @type@ in1 = *(@type@ *)ip1;
*(@type@ *)op1 = npy_@intrin@@vsub@(in1);
}
}

@mattip
Copy link
Member

mattip commented Jan 24, 2023

We discussed approaches to using SIMD intrinsics in NEP 38. Specifically, we have a section about code enhancements. We did not really apply that section in the discussion to add SVML (PR #19478) other than to note

Getting SVML with BSD license is great deal, and it gonna be good base for start replacing them to universal intrinsics. Thank you!

There was a brief mention of SLEEF in that PR, but we did not consider using SLEEF instead/in addition to SVML.

Looking back over the mailing list, there is the discussion in 2015 mentioned in the SVML PR, and a recent mail from Chris Sidebottom about an effort to target aarch64.

I am not sure how I feel about integrating yet another vendored library for accelerated operations. On the one hand, we already have precedent with SVML. Integrating SLEEF would improve performance for other platforms. On the other, SLEEF's sources are twice as large as SVML, and the scope is larger. Would we then declare that we are not going to move these functions to universal intrinsics? What would we do with the code from #17587, #18101, and more? Could we do something more generic so that people who wished to could switch out SVML entirely, or use VOLK (GPL3) or simd or another library?

Maybe I am overthinking this, and we should just move forward since there is a contributor willing to do the work. I do think this should hit the mailing list.

@kawakami-k
Copy link
Contributor

@mattip
Thank you for letting me know about the previous discussions. I would consider discussing this on the mailing list.

@yamadafuyuka
Copy link
Author

@mattip
Thank you very much for your comment.
I would consider it with @kawakami-k .

@seiko2plus seiko2plus self-assigned this Jan 25, 2023
@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jan 25, 2023
@seiko2plus seiko2plus removed their assignment Feb 8, 2023
@Mousius
Copy link
Member

Mousius commented May 18, 2023

@mattip is it worth re-visiting this as the universal intrinsics work is likely to be fairly long lived (#23603 has been open for a month now with no activity)? SLEEF could provide some short-term boost though I don't think it handles errors correctly from my initial look at it.

@mattip
Copy link
Member

mattip commented May 19, 2023

I don't think SLEEF is a step in the right direction, I think we should close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

No branches or pull requests

5 participants