ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

seiko2plus · 2025-09-06T15:16:21Z

numpy-simd-routines added as subrepo in meson subprojects
directory and the current FP configuration is static, ~1ulp used for double-precision
~4ulp for single-precision with handling floating-point errors,
special-cases extended precision for large arguments,
subnormals are enabled by default too.

numpy-simd-routines supports all SIMD extensions that are supported
by Google Highway including non-FMA extensions and is fully independent
from libm to guarantee unified results across all compilers and
platforms.

Full benchmarks will be provided within the pull-request, the following
benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X)
with AVX512 enabled.

Note: that there was no SIMD optimization enabled for sin/cos
for double-precision before, only single-precision.

Before	After	Ratio	Benchmark (Parameter)
713±6μs	633±6μs	0.89	UnaryFP(<ufunc 'cos'>, 1, 2, 'f')
717±9μs	637±6μs	0.89	UnaryFP(<ufunc 'cos'>, 4, 1, 'f')
705±3μs	607±10μs	0.86	UnaryFP(<ufunc 'sin'>, 4, 1, 'f')
714±10μs	595±0.5μs	0.83	UnaryFP(<ufunc 'sin'>, 1, 2, 'f')
370±0.3μs	277±4μs	0.75	UnaryFP(<ufunc 'cos'>, 1, 1, 'f')
373±2μs	236±0.6μs	0.63	UnaryFP(<ufunc 'sin'>, 1, 1, 'f')
1.06±0.01ms	648±3μs	0.61	UnaryFP(<ufunc 'cos'>, 4, 2, 'f')
1.06±0.01ms	617±30μs	0.58	UnaryFP(<ufunc 'sin'>, 4, 2, 'f')
5.06±0.06ms	2.61±0.3ms	0.52	UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd')
1.48±0ms	715±5μs	0.48	UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f')
1.50±0.01ms	639±6μs	0.43	UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f')
5.15±0.1ms	1.96±0.01ms	0.38	UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd')
5.72±0.02ms	2.09±0.1ms	0.37	UnaryFP(<ufunc 'cos'>, 4, 2, 'd')
5.76±0.01ms	2.03±0.08ms	0.35	UnaryFP(<ufunc 'sin'>, 4, 2, 'd')
5.07±0.08ms	1.76±0.2ms	0.35	UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd')
6.04±0.04ms	2.05±0.09ms	0.34	UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd')
5.79±0.03ms	1.90±0.2ms	0.33	UnaryFP(<ufunc 'sin'>, 4, 1, 'd')
2.29±0.1ms	762±40μs	0.33	UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f')
5.72±0.1ms	1.75±0.07ms	0.31	UnaryFP(<ufunc 'cos'>, 4, 1, 'd')
6.04±0.03ms	1.82±0.2ms	0.3	UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd')
2.49±0.1ms	748±30μs	0.3	UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f')
2.23±0.1ms	634±6μs	0.28	UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f')
1.31±0.03ms	367±5μs	0.28	UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f')
2.55±0.09ms	654±30μs	0.26	UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f')
4.97±0.03ms	1.14±0ms	0.23	UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd')
5.67±0.01ms	1.22±0.03ms	0.22	UnaryFP(<ufunc 'cos'>, 1, 2, 'd')
5.76±0.03ms	1.28±0.06ms	0.22	UnaryFP(<ufunc 'sin'>, 1, 2, 'd')
1.26±0.01ms	272±2μs	0.22	UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f')
7.03±0.02ms	1.31±0.01ms	0.19	UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd')
5.67±0.01ms	810±9μs	0.14	UnaryFP(<ufunc 'cos'>, 1, 1, 'd')
5.71±0.01ms	817±40μs	0.14	UnaryFP(<ufunc 'sin'>, 1, 1, 'd')
7.05±0.03ms	915±4μs	0.13	UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd')

numpy-simd-routines added as subrepo in meson subprojects directory and the current FP configuration is static, ~1ulp used for double-precision ~4ulp for single-precision with handling floating-point errors, special-cases extended precision for large arguments, subnormals are enabled by default too. numpy-simd-routines supports all SIMD extensions that are supported by Google Highway including non-FMA extensions and is fully independent from libm to guarantee unified results across all compilers and platforms. Full benchmarks will be provided within the pull-request, the following benchmark was tested against clang-19 and x86 CPU (Ryzen7 7700X) with AVX512 enabled. Note: that there was no SIMD optimization enabled for sin/cos for double-precision, only single-precision. | Before | After | Ratio | Benchmark (Parameter) | |---------------|-------------|--------|------------------------------------------| | 713±6μs | 633±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 1, 2, 'f') | | 717±9μs | 637±6μs | 0.89 | UnaryFP(<ufunc 'cos'>, 4, 1, 'f') | | 705±3μs | 607±10μs | 0.86 | UnaryFP(<ufunc 'sin'>, 4, 1, 'f') | | 714±10μs | 595±0.5μs | 0.83 | UnaryFP(<ufunc 'sin'>, 1, 2, 'f') | | 370±0.3μs | 277±4μs | 0.75 | UnaryFP(<ufunc 'cos'>, 1, 1, 'f') | | 373±2μs | 236±0.6μs | 0.63 | UnaryFP(<ufunc 'sin'>, 1, 1, 'f') | | 1.06±0.01ms | 648±3μs | 0.61 | UnaryFP(<ufunc 'cos'>, 4, 2, 'f') | | 1.06±0.01ms | 617±30μs | 0.58 | UnaryFP(<ufunc 'sin'>, 4, 2, 'f') | | 5.06±0.06ms | 2.61±0.3ms | 0.52 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'd') | | 1.48±0ms | 715±5μs | 0.48 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'f') | | 1.50±0.01ms | 639±6μs | 0.43 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'f') | | 5.15±0.1ms | 1.96±0.01ms | 0.38 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'd') | | 5.72±0.02ms | 2.09±0.1ms | 0.37 | UnaryFP(<ufunc 'cos'>, 4, 2, 'd') | | 5.76±0.01ms | 2.03±0.08ms | 0.35 | UnaryFP(<ufunc 'sin'>, 4, 2, 'd') | | 5.07±0.08ms | 1.76±0.2ms | 0.35 | UnaryFPSpecial(<ufunc 'cos'>, 1, 2, 'd') | | 6.04±0.04ms | 2.05±0.09ms | 0.34 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'd') | | 5.79±0.03ms | 1.90±0.2ms | 0.33 | UnaryFP(<ufunc 'sin'>, 4, 1, 'd') | | 2.29±0.1ms | 762±40μs | 0.33 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'f') | | 5.72±0.1ms | 1.75±0.07ms | 0.31 | UnaryFP(<ufunc 'cos'>, 4, 1, 'd') | | 6.04±0.03ms | 1.82±0.2ms | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 1, 'd') | | 2.49±0.1ms | 748±30μs | 0.3 | UnaryFPSpecial(<ufunc 'sin'>, 4, 2, 'f') | | 2.23±0.1ms | 634±6μs | 0.28 | UnaryFPSpecial(<ufunc 'cos'>, 4, 1, 'f') | | 1.31±0.03ms | 367±5μs | 0.28 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'f') | | 2.55±0.09ms | 654±30μs | 0.26 | UnaryFPSpecial(<ufunc 'cos'>, 4, 2, 'f') | | 4.97±0.03ms | 1.14±0ms | 0.23 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'd') | | 5.67±0.01ms | 1.22±0.03ms | 0.22 | UnaryFP(<ufunc 'cos'>, 1, 2, 'd') | | 5.76±0.03ms | 1.28±0.06ms | 0.22 | UnaryFP(<ufunc 'sin'>, 1, 2, 'd') | | 1.26±0.01ms | 272±2μs | 0.22 | UnaryFPSpecial(<ufunc 'cos'>, 1, 1, 'f') | | 7.03±0.02ms | 1.31±0.01ms | 0.19 | UnaryFPSpecial(<ufunc 'sin'>, 1, 2, 'd') | | 5.67±0.01ms | 810±9μs | 0.14 | UnaryFP(<ufunc 'cos'>, 1, 1, 'd') | | 5.71±0.01ms | 817±40μs | 0.14 | UnaryFP(<ufunc 'sin'>, 1, 1, 'd') | | 7.05±0.03ms | 915±4μs | 0.13 | UnaryFPSpecial(<ufunc 'sin'>, 1, 1, 'd') |

Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.

github-actions bot added the 01 - Enhancement label Sep 6, 2025

seiko2plus force-pushed the brings_npsr branch 2 times, most recently from 0ccab74 to 77f1bc9 Compare September 6, 2025 15:27

seiko2plus force-pushed the brings_npsr branch from 77f1bc9 to a3f746e Compare September 6, 2025 15:31

seiko2plus added 2 commits September 8, 2025 01:31

fix up

0cbb055

Relax sin/cos ULP test for float32 on non-FMA

af5d98a

Allow up to 3 ULP error for float32 sin/cos when native FMA is not available.

seiko2plus force-pushed the brings_npsr branch from 09414c8 to af5d98a Compare September 7, 2025 22:33

seiko2plus added 2 commits September 8, 2025 12:02

fix up up

e782954

fix up up up up

2708fc0

rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 10, 2025

seiko2plus added this to the 2.4.0 release milestone Sep 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

seiko2plus commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

Are you sure you want to change the base?

ENH: Add SIMD sin/cos implementation with numpy-simd-routines #29699

Conversation

seiko2plus commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Sep 6, 2025 •

edited

Loading