Missed loop vectorization in `add_fixed` example.

The [`add_fixed` example](https://github.com/gendx/paralight/blob/4613509cdc52847e11e292863486d7bcd97437a6/examples/add_fixed.rs) adds two slices element-wise using Paralight in `RangeStrategy::Fixed` mode. This means that each thread should simply process a triple of fixed sub-slices serially, and the per-thread compiled code should be similar to the [`add_serial` example](https://github.com/gendx/paralight/blob/4613509cdc52847e11e292863486d7bcd97437a6/examples/add_serial.rs).

However, in practice the fully serial example is compiled into a vectorized loop using SIMD instructions, while the Paralight example is compiled into a more naive element-by-element loop. This leaves a lot of performance on the table, as the additional speed offered by multi-threading with Paralight is counter-balanced by the lack of vectorization (depending on the number of threads and number of SIMD lanes).

Ideally, the Paralight code should benefit from both multi-threading and vectorization, especially with `RangeStrategy::Fixed`.

**Note**: This observation applies to this specific simple example, so the risk of Paralight code being under-optimized should be lower for more complex loops.

This is similar to #12 but tracking it separately as a fix might be more complex.

# Methodology

At commit 4613509cdc52847e11e292863486d7bcd97437a6.

`cargo --version -v`:

```
cargo 1.90.0 (840b83a10 2025-07-30)
release: 1.90.0
commit-hash: 840b83a10fb0e039a83f4d70ad032892c287570a
commit-date: 2025-07-30
host: x86_64-unknown-linux-gnu
```

Build step:

```bash
RUSTFLAGS='-C force-frame-pointers=y' cargo build --release --examples
```

Profiling step:

```bash
perf record -g ./target/release/examples/add_serial
```

## Baseline `add_serial`

Disassembly of the hot loop:

```bash
$ perf annotate --stdio2 > add_serial.log

              Disassembly of section .text:
                 
              0000000000015210 <add_serial::main>:
                ...
         330:   movdqu (%rax,%r8,8),%xmm0
  14.49         movdqu 0x10(%rax,%r8,8),%xmm1
                movdqu (%rcx,%r8,8),%xmm2
   2.56         paddq  %xmm0,%xmm2
   9.79         movdqu 0x10(%rcx,%r8,8),%xmm0
   1.72         paddq  %xmm1,%xmm0
                movdqu %xmm2,(%rdx,%r8,8)
  20.90         movdqu %xmm0,0x10(%rdx,%r8,8)
                add    $0x4,%r8 
                cmp    %r8,%rdi 
   4.11       ↑ jne    330 
```

## Paralight `add_fixed` (`RangeStrategy::Fixed`)

Disassembly of the hot loop:

```bash
$ perf annotate --stdio2 > add_fixed.log

              Disassembly of section .text:
                 
              0000000000020000 <<paralight::core::thread_pool::IterPipelineImpl<Output,Accum,Cleanup> as paralight::core::thread_pool::Pipeline<R>>::run>:
                ...
          30:   cmp    0x10(%rcx),%rax
              ↓ jae    f0       
   0.11         mov    0x8(%rcx),%rsi
                cmp    %rsi,%rax
              ↓ jae    179      
  14.17         mov    0x28(%rcx),%rsi
                cmp    %rsi,%rax
              ↓ jae    179      
                mov    0x18(%rcx),%rsi
   0.10         test   %rsi,%rsi
              ↓ je     7c       
  15.14         mov    (%rcx),%rdi
   0.32         mov    0x20(%rcx),%r8
  11.65         mov    (%r8,%rax,8),%r8
  10.14         add    (%rdi,%rax,8),%r8
  33.32         mov    %r8,(%rsi,%rax,8)
                lea    0x1(%rax),%rsi
   0.32         mov    %rsi,%rax
                cmp    %rsi,%rdx
  14.73       ↑ jne    30       
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missed loop vectorization in `add_fixed` example. #13

Methodology

Baseline `add_serial`

Paralight `add_fixed` (`RangeStrategy::Fixed`)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Missed loop vectorization in add_fixed example. #13

Description

Methodology

Baseline add_serial

Paralight add_fixed (RangeStrategy::Fixed)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Missed loop vectorization in `add_fixed` example. #13

Baseline `add_serial`

Paralight `add_fixed` (`RangeStrategy::Fixed`)