yazon · yazon · Aug 14, 2025
diff --git a/neon_oe_analysis.md b/neon_oe_analysis.md
@@ -0,0 +1,389 @@
+# Detailed Analysis of ARM32 NEON `neon_oe` Function
+
+## Overview
+
+The `neon_oe` function in `src/neon.s` is a highly optimized ARM32 NEON assembly implementation for performing FFT (Fast Fourier Transform) butterfly operations. The function name likely stands for "odd-even" processing, which is a common pattern in FFT algorithms.
+
+## Function Entry Assumptions
+
+### Register Setup (Lines 543-548)
+```assembly
+@ assumes r0 = out 
+@         r12 = offsets
+@         r3-r10 = data pointers
+@         r11 = addr of twiddle 
+@         r2 & lr = temps
+```
+
+**Input Registers:**
+- `r0`: Output buffer base address
+- `r12`: Pointer to offset array (used to calculate output addresses)
+- `r3`: Data pointer 1 (input data)
+- `r4`: Data pointer 2 (input data)
+- `r5`: Data pointer 3 (input data)
+- `r6`: Data pointer 4 (input data)
+- `r7`: Data pointer 5 (input data)
+- `r8`: Data pointer 6 (input data)
+- `r9`: Data pointer 7 (input data)
+- `r10`: Data pointer 8 (input data)
+- `r11`: Pointer to twiddle factor table
+- `r2`, `lr`: Temporary registers (will be overwritten)
+
+## Instruction-by-Instruction Analysis
+
+### Phase 1: Initial Data Loading (Lines 557-561)
+
+```assembly
+557:  vld1.32  {q8},  [r5,  :128]!
+558:  vld1.32  {q10}, [r6,  :128]!
+559:  vld2.32  {q11}, [r4,  :128]!
+560:  vld2.32  {q13}, [r3,  :128]!
+561:  vld2.32  {q15}, [r10, :128]!
+```
+
+**Analysis:**
+- `vld1.32 {q8}, [r5, :128]!`: Loads 4 consecutive 32-bit floats from address in r5 into q8 (d16-d17). The `:128` indicates 128-bit alignment. The `!` updates r5 by 16 bytes.
+- `vld1.32 {q10}, [r6, :128]!`: Loads 4 consecutive floats from r6 into q10 (d20-d21).
+- `vld2.32 {q11}, [r4, :128]!`: **Deinterleaving load** - loads 8 floats from r4, putting even-indexed elements (0,2,4,6) into d22 and odd-indexed elements (1,3,5,7) into d23. This separates real and imaginary parts of complex numbers.
+- `vld2.32 {q13}, [r3, :128]!`: Deinterleaving load from r3 into q13 (d26-d27).
+- `vld2.32 {q15}, [r10, :128]!`: Deinterleaving load from r10 into q15 (d30-d31).
+
+**Data Organization After Loading:**
+- q8 (d16-d17): 4 consecutive values from r5
+- q10 (d20-d21): 4 consecutive values from r6
+- q11: d22 = real parts, d23 = imaginary parts from r4
+- q13: d26 = real parts, d27 = imaginary parts from r3
+- q15: d30 = real parts, d31 = imaginary parts from r10
+
+### Phase 2: Register Reorganization (Lines 562-564)
+
+```assembly
+562:  vorr     d25, d17, d17
+563:  vorr     d24, d20, d20
+564:  vorr     d20, d16, d16
+```
+
+**Analysis:**
+These are copy operations using bitwise OR with itself:
+- `d25 = d17` (copy upper half of q8)
+- `d24 = d20` (copy lower half of q10)
+- `d20 = d16` (copy lower half of q8)
+
+This creates q12 with d24-d25 containing data from q8 and q10.
+
+### Phase 3: First Butterfly Computations (Lines 565-566)
+
+```assembly
+565:  vsub.f32 q9,  q13, q11
+566:  vadd.f32 q11, q13, q11
+```
+
+**Butterfly operation on complex data:**
+- `q9 = q13 - q11`: Difference (complex subtraction)
+- `q11 = q13 + q11`: Sum (complex addition)
+
+Since q11 and q13 were loaded with vld2, their real and imaginary parts are already separated:
+- d18 = real(q13) - real(q11)
+- d19 = imag(q13) - imag(q11)
+- d22 = real(q13) + real(q11)
+- d23 = imag(q13) + imag(q11)
+
+### Phase 4: Output Address Calculation (Lines 567-573)
+
+```assembly
+567:  ldr      r2,  [r12], #4
+568:  vtrn.32  d24, d25
+569:  ldr      lr,  [r12], #4
+570:  vtrn.32  d20, d21
+571:  add      r2,  r0,  r2, lsl #2
+572:  vsub.f32 q8,  q10, q12
+573:  add      lr,  r0,  lr, lsl #2
+```
+
+**Analysis:**
+- Load offset values from r12 and calculate output addresses
+- `vtrn.32` performs 32-bit element transposition (interleaving/deinterleaving)
+- Continue butterfly: `q8 = q10 - q12`
+
+### Phase 5: Complete First Set of Butterflies (Lines 574-580)
+
+```assembly
+574:  vadd.f32 q10, q10, q12
+575:  vadd.f32 q0,  q11, q10
+576:  vadd.f32 d25, d19, d16
+577:  vsub.f32 d27, d19, d16
+578:  vsub.f32 q1,  q11, q10
+579:  vsub.f32 d24, d18, d17
+580:  vadd.f32 d26, d18, d17
+```
+
+Complex butterfly results:
+- `q10 = q10 + q12`: Complete the butterfly
+- `q0 = q11 + q10`: Second stage butterfly sum
+- `q1 = q11 - q10`: Second stage butterfly difference
+- Individual lane operations on d24-d27 for real/imaginary handling
+
+### Phase 6: Data Transposition and First Store (Lines 581-585)
+
+```assembly
+581:  vtrn.32  q0,  q12
+582:  vtrn.32  q1,  q13
+583:  vld1.32  {d24, d25}, [r11, :128]
+584:  vswp     d1, d2
+585:  vst1.32  {q0,  q1},  [r2, :128]!
+```
+
+- Transpose results for proper output format
+- Load twiddle factors from r11 into q12
+- Swap operation to arrange data correctly
+- Store first set of results to output
+
+### Phase 7: Second Set of Data Loading (Lines 586-589)
+
+```assembly
+586:  vld2.32  {q0},  [r9, :128]!
+587:  vadd.f32 q1,  q0, q15
+588:  vld2.32  {q13}, [r8, :128]!
+589:  vld2.32  {q14}, [r7, :128]!
+```
+
+Load more complex data using deinterleaving loads.
+
+### Phase 8: Second Set of Butterflies (Lines 590-602)
+
+```assembly
+590:  vsub.f32 q15, q0,  q15
+591:  vsub.f32 q0,  q14, q13
+592:  vadd.f32 q3,  q14, q13
+593:  vadd.f32 q2,  q3,  q1
+594:  vadd.f32 d29, d1,  d30
+595:  vsub.f32 d27, d1,  d30
+596:  vsub.f32 q3,  q3,  q1
+597:  vsub.f32 d28, d0,  d31
+598:  vadd.f32 d26, d0,  d31
+599:  vtrn.32  q2,  q14
+600:  vtrn.32  q3,  q13
+601:  vswp     d5, d6
+602:  vst1.32  {q2, q3}, [r2, :128]!
+```
+
+Similar butterfly operations on the second set of data, followed by transposition and storage.
+
+### Phase 9: Twiddle Factor Multiplication (Lines 603-616)
+
+```assembly
+603:  vtrn.32  q11, q9
+604:  vtrn.32  q10, q8
+605:  vmul.f32 d20, d18, d25
+606:  vmul.f32 d22, d19, d24
+607:  vmul.f32 d21, d19, d25
+608:  vmul.f32 d18, d18, d24
+609:  vmul.f32 d19, d16, d25
+610:  vmul.f32 d30, d17, d24
+611:  vmul.f32 d23, d16, d24
+612:  vmul.f32 d24, d17, d25
+613:  vadd.f32 d17, d22, d20
+614:  vsub.f32 d16, d18, d21
+615:  vsub.f32 d21, d30, d19
+616:  vadd.f32 d20, d24, d23
+```
+
+**Complex multiplication with twiddle factors:**
+- This implements complex multiplication: (a + bi) × (c + di) = (ac - bd) + (ad + bc)i
+- The twiddle factors in q12 (d24=real, d25=imag) are multiplied with the butterfly results
+- Results are properly combined for real and imaginary parts
+
+### Phase 10: Final Butterflies and Storage (Lines 617-627)
+
+```assembly
+617:  vadd.f32 q9,  q8,  q10
+618:  vsub.f32 q8,  q8,  q10
+619:  vadd.f32 q4,  q14, q9
+620:  vsub.f32 q6,  q14, q9
+621:  vadd.f32 d11, d27, d16
+622:  vsub.f32 d15, d27, d16
+623:  vsub.f32 d10, d26, d17
+624:  vadd.f32 d14, d26, d17
+625:  vswp     d9,  d10
+626:  vswp     d13, d14
+627:  vstmia   lr!, {q4-q7}
+```
+
+Final butterfly operations and storage of the remaining results to the output buffer at address lr.
+
+## Detailed Complex Arithmetic Analysis
+
+### Complex Number Representation
+- **Memory Layout**: Complex numbers are stored as interleaved pairs [Re₀, Im₀, Re₁, Im₁, ...]
+- **Register Layout After vld2**: 
+  - Even register (e.g., d22): [Re₀, Re₁, Re₂, Re₃]
+  - Odd register (e.g., d23): [Im₀, Im₁, Im₂, Im₃]
+
+### Butterfly Operation Pattern
+The function implements a radix-8 FFT butterfly, processing 8 input points to produce 8 output points:
+
+1. **First Layer Butterflies** (Lines 565-566, 572-574):
+   - Input pairs: (q13, q11) and (q10, q12)
+   - Outputs: Sum and difference of each pair
+
+2. **Second Layer Butterflies** (Lines 575-580):
+   - Combines results from first layer
+   - Creates 4-point butterfly structure
+
+3. **Twiddle Factor Application** (Lines 605-616):
+   - Complex multiplication with twiddle factors
+   - Implements: Output = Butterfly_Result × W^k
+
+### Data Flow Diagram
+```
+Input (8 complex points from r3-r10):
+    [r3] ─┐
+          ├─ Butterfly ─┐
+    [r4] ─┘             │
+                        ├─ Butterfly ─┐
+    [r5] ─┐             │             │
+          ├─ Butterfly ─┘             ├─ Twiddle ─→ Output
+    [r6] ─┘                           │
+                                      │
+    [r7] ─┐                           │
+          ├─ Butterfly ─┐             │
+    [r8] ─┘             │             │
+                        ├─ Butterfly ─┘
+    [r9] ─┐             │
+          ├─ Butterfly ─┘
+    [r10]─┘
+```
+
+## Memory Access Pattern
+
+### Input Pointers (r3-r10)
+- Each pointer accesses 16 bytes (4 complex numbers in interleaved format)
+- Post-increment updates move each pointer forward by 16 bytes
+- Total input: 8 × 4 = 32 complex numbers per call
+
+### Output Storage
+- Two output locations calculated from offsets in r12
+- First output: 32 bytes stored at address r2
+- Second output: 32 bytes stored at address lr
+- Output format maintains interleaved real/imaginary layout
+
+### Twiddle Factors (r11)
+- Loaded once per function call
+- Contains pre-computed sine/cosine values for FFT
+- Format: [cos, sin] pairs for complex multiplication
+
+## Performance Characteristics
+
+### Instruction-Level Parallelism
+- Multiple independent operations scheduled together
+- Address calculations interleaved with SIMD operations
+- Maximizes CPU pipeline utilization
+
+### Memory Access Optimization
+- 128-bit aligned loads/stores for optimal cache line usage
+- Deinterleaving loads (vld2) save separate deinterleaving instructions
+- Post-increment addressing reduces instruction count
+
+### Register Pressure Management
+- All 16 NEON registers (q0-q15) are utilized
+- Careful scheduling minimizes register spills
+- Temporary values reuse registers as soon as possible
+
+## Register Usage Summary
+
+### NEON Registers Modified:
+- q0-q15: All used for computation and temporary storage
+- Specifically:
+  - q8-q10: Initial data and intermediate results
+  - q11, q13, q15: Complex data (real/imag separated)
+  - q12: Twiddle factors
+  - q0-q7: Final results
+
+### ARM Registers Modified:
+- r2: Temporary, used for output address calculation
+- lr: Temporary, used for output address calculation
+- r3-r10: Updated by post-increment addressing (moved forward by 16 bytes)
+- r12: Updated by post-increment (moved forward by 8 bytes)
+
+### Preserved Registers:
+- r0: Output base address (unchanged)
+- r11: Twiddle factor pointer (unchanged)
+
+## Data Flow Analysis
+
+1. **Input Format**: Complex numbers stored as interleaved real/imaginary pairs
+2. **Processing**: 
+   - Deinterleaving loads separate real/imaginary
+   - Butterfly operations on separated components
+   - Twiddle factor multiplication
+   - Re-interleaving for output
+3. **Output Format**: Processed complex numbers in interleaved format
+
+## Key Optimizations
+
+1. **Deinterleaving Loads**: vld2 instructions automatically separate real/imaginary parts
+2. **Parallel Processing**: Processes multiple complex numbers simultaneously
+3. **Efficient Address Updates**: Post-increment addressing modes
+4. **Register Reuse**: Careful management to minimize register pressure
+5. **Aligned Access**: :128 alignment hints for optimal memory access
+
+## Considerations for ARM64 Porting
+
+### Register Mapping
+```
+ARM32 → ARM64
+q0-q15 → v0-v31 (doubled register count)
+r0-r12,lr → x0-x30 (64-bit registers)
+d0-d31 → v0.d[0], v0.d[1], etc. (different syntax)
+```
+
+### Instruction Translation
+```
+ARM32                    → ARM64
+vld1.32 {q8}, [r5]!     → ld1 {v8.4s}, [x5], #16
+vld2.32 {q11}, [r4]!    → ld2 {v11.4s, v12.4s}, [x4], #32
+vadd.f32 q0, q1, q2     → fadd v0.4s, v1.4s, v2.4s
+vsub.f32 q0, q1, q2     → fsub v0.4s, v1.4s, v2.4s
+vmul.f32 d0, d1, d2     → fmul v0.2s, v1.2s, v2.2s
+vtrn.32 q0, q1          → trn1/trn2 or zip1/zip2
+vstmia r0!, {q0-q3}     → st1 {v0.4s-v3.4s}, [x0], #64
+```
+
+### Optimization Opportunities for ARM64
+
+1. **Additional Registers**: Use v16-v31 to reduce dependencies
+2. **Larger Vectors**: Consider 256-bit SVE instructions if available
+3. **Fused Operations**: Use fmla/fmls for multiply-accumulate
+4. **Better Scheduling**: More registers allow better instruction scheduling
+5. **Predication**: SVE provides predicated operations for conditional execution
+
+### Potential ARM64 Implementation Structure
+```assembly
+// Example ARM64 equivalent structure
+function neon_oe_arm64:
+    // Load with deinterleaving
+    ld2 {v11.4s, v12.4s}, [x4], #32
+    ld2 {v13.4s, v14.4s}, [x3], #32
+
+    // Butterfly operations
+    fadd v22.4s, v13.4s, v11.4s
+    fsub v18.4s, v13.4s, v11.4s
+
+    // Twiddle multiplication using fmla
+    fmul v16.4s, v18.4s, v25.4s
+    fmla v16.4s, v19.4s, v24.4s
+
+    // Store with interleaving
+    st2 {v0.4s, v1.4s}, [x2], #32
+```
+
+## Summary
+
+The `neon_oe` function is a highly optimized radix-8 FFT butterfly implementation that:
+- Processes 32 complex numbers (8 groups of 4) per invocation
+- Uses all available NEON registers efficiently
+- Implements complex arithmetic using deinterleaved data layout
+- Applies twiddle factors for FFT frequency domain transformation
+- Maintains excellent memory access patterns with aligned loads/stores
+
+The function demonstrates expert-level ARM32 NEON optimization techniques that should be carefully preserved when porting to ARM64, while taking advantage of the newer architecture's enhanced capabilities.