-
Notifications
You must be signed in to change notification settings - Fork 2
Description
The add_fixed example adds two slices element-wise using Paralight in RangeStrategy::Fixed mode. This means that each thread should simply process a triple of fixed sub-slices serially, and the per-thread compiled code should be similar to the add_serial example.
However, in practice the fully serial example is compiled into a vectorized loop using SIMD instructions, while the Paralight example is compiled into a more naive element-by-element loop. This leaves a lot of performance on the table, as the additional speed offered by multi-threading with Paralight is counter-balanced by the lack of vectorization (depending on the number of threads and number of SIMD lanes).
Ideally, the Paralight code should benefit from both multi-threading and vectorization, especially with RangeStrategy::Fixed.
Note: This observation applies to this specific simple example, so the risk of Paralight code being under-optimized should be lower for more complex loops.
This is similar to #12 but tracking it separately as a fix might be more complex.
Methodology
At commit 4613509.
cargo --version -v:
cargo 1.90.0 (840b83a10 2025-07-30)
release: 1.90.0
commit-hash: 840b83a10fb0e039a83f4d70ad032892c287570a
commit-date: 2025-07-30
host: x86_64-unknown-linux-gnu
Build step:
RUSTFLAGS='-C force-frame-pointers=y' cargo build --release --examplesProfiling step:
perf record -g ./target/release/examples/add_serialBaseline add_serial
Disassembly of the hot loop:
$ perf annotate --stdio2 > add_serial.log
Disassembly of section .text:
0000000000015210 <add_serial::main>:
...
330: movdqu (%rax,%r8,8),%xmm0
14.49 movdqu 0x10(%rax,%r8,8),%xmm1
movdqu (%rcx,%r8,8),%xmm2
2.56 paddq %xmm0,%xmm2
9.79 movdqu 0x10(%rcx,%r8,8),%xmm0
1.72 paddq %xmm1,%xmm0
movdqu %xmm2,(%rdx,%r8,8)
20.90 movdqu %xmm0,0x10(%rdx,%r8,8)
add $0x4,%r8
cmp %r8,%rdi
4.11 ↑ jne 330 Paralight add_fixed (RangeStrategy::Fixed)
Disassembly of the hot loop:
$ perf annotate --stdio2 > add_fixed.log
Disassembly of section .text:
0000000000020000 <<paralight::core::thread_pool::IterPipelineImpl<Output,Accum,Cleanup> as paralight::core::thread_pool::Pipeline<R>>::run>:
...
30: cmp 0x10(%rcx),%rax
↓ jae f0
0.11 mov 0x8(%rcx),%rsi
cmp %rsi,%rax
↓ jae 179
14.17 mov 0x28(%rcx),%rsi
cmp %rsi,%rax
↓ jae 179
mov 0x18(%rcx),%rsi
0.10 test %rsi,%rsi
↓ je 7c
15.14 mov (%rcx),%rdi
0.32 mov 0x20(%rcx),%r8
11.65 mov (%r8,%rax,8),%r8
10.14 add (%rdi,%rax,8),%r8
33.32 mov %r8,(%rsi,%rax,8)
lea 0x1(%rax),%rsi
0.32 mov %rsi,%rax
cmp %rsi,%rdx
14.73 ↑ jne 30