Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Missed loop vectorization in add_fixed example. #13

@gendx

Description

@gendx

The add_fixed example adds two slices element-wise using Paralight in RangeStrategy::Fixed mode. This means that each thread should simply process a triple of fixed sub-slices serially, and the per-thread compiled code should be similar to the add_serial example.

However, in practice the fully serial example is compiled into a vectorized loop using SIMD instructions, while the Paralight example is compiled into a more naive element-by-element loop. This leaves a lot of performance on the table, as the additional speed offered by multi-threading with Paralight is counter-balanced by the lack of vectorization (depending on the number of threads and number of SIMD lanes).

Ideally, the Paralight code should benefit from both multi-threading and vectorization, especially with RangeStrategy::Fixed.

Note: This observation applies to this specific simple example, so the risk of Paralight code being under-optimized should be lower for more complex loops.

This is similar to #12 but tracking it separately as a fix might be more complex.

Methodology

At commit 4613509.

cargo --version -v:

cargo 1.90.0 (840b83a10 2025-07-30)
release: 1.90.0
commit-hash: 840b83a10fb0e039a83f4d70ad032892c287570a
commit-date: 2025-07-30
host: x86_64-unknown-linux-gnu

Build step:

RUSTFLAGS='-C force-frame-pointers=y' cargo build --release --examples

Profiling step:

perf record -g ./target/release/examples/add_serial

Baseline add_serial

Disassembly of the hot loop:

$ perf annotate --stdio2 > add_serial.log

              Disassembly of section .text:
                 
              0000000000015210 <add_serial::main>:
                ...
         330:   movdqu (%rax,%r8,8),%xmm0
  14.49         movdqu 0x10(%rax,%r8,8),%xmm1
                movdqu (%rcx,%r8,8),%xmm2
   2.56         paddq  %xmm0,%xmm2
   9.79         movdqu 0x10(%rcx,%r8,8),%xmm0
   1.72         paddq  %xmm1,%xmm0
                movdqu %xmm2,(%rdx,%r8,8)
  20.90         movdqu %xmm0,0x10(%rdx,%r8,8)
                add    $0x4,%r8 
                cmp    %r8,%rdi 
   4.11       ↑ jne    330 

Paralight add_fixed (RangeStrategy::Fixed)

Disassembly of the hot loop:

$ perf annotate --stdio2 > add_fixed.log

              Disassembly of section .text:
                 
              0000000000020000 <<paralight::core::thread_pool::IterPipelineImpl<Output,Accum,Cleanup> as paralight::core::thread_pool::Pipeline<R>>::run>:
                ...
          30:   cmp    0x10(%rcx),%rax
              ↓ jae    f0       
   0.11         mov    0x8(%rcx),%rsi
                cmp    %rsi,%rax
              ↓ jae    179      
  14.17         mov    0x28(%rcx),%rsi
                cmp    %rsi,%rax
              ↓ jae    179      
                mov    0x18(%rcx),%rsi
   0.10         test   %rsi,%rsi
              ↓ je     7c       
  15.14         mov    (%rcx),%rdi
   0.32         mov    0x20(%rcx),%r8
  11.65         mov    (%r8,%rax,8),%r8
  10.14         add    (%rdi,%rax,8),%r8
  33.32         mov    %r8,(%rsi,%rax,8)
                lea    0x1(%rax),%rsi
   0.32         mov    %rsi,%rax
                cmp    %rsi,%rdx
  14.73       ↑ jne    30       

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions