yazon · yazon · Aug 15, 2025
diff --git a/neon_x8_analysis.md b/neon_x8_analysis.md
@@ -0,0 +1,224 @@
+# Detailed Analysis of ARM32 NEON `neon_x8` Function
+
+## Overview
+The `neon_x8` function implements an 8-point FFT (Fast Fourier Transform) using ARM32 NEON SIMD instructions. This is a highly optimized assembly routine that processes complex number arrays using vectorized operations.
+
+## Function Signature
+```assembly
+neon_x8:
+    @ r0 = data pointer (input/output buffer)
+    @ r1 = stride (distance between elements in bytes)
+    @ r2 = twiddle factor lookup table (LUT) pointer
+```
+
+## Register Usage and Setup
+
+### Initial Register Assignment (Lines 91-100)
+```assembly
+mov      r11, #0                    @ Loop counter initialization
+add      r3, r0, #0                 @ data0 = base address
+add      r5, r0, r1, lsl #1         @ data2 = base + 2*stride
+add      r4, r0, r1                 @ data1 = base + stride
+add      r7, r5, r1, lsl #1         @ data4 = base + 4*stride
+add      r6, r5, r1                 @ data3 = base + 3*stride
+add      r9, r7, r1, lsl #1         @ data6 = base + 6*stride
+add      r8, r7, r1                 @ data5 = base + 5*stride
+add      r10, r9, r1                @ data7 = base + 7*stride
+add      r12, r2, #0                @ LUT pointer
+```
+
+### Data Pointer Layout
+- **r3**: Points to data[0]
+- **r4**: Points to data[1]
+- **r5**: Points to data[2]
+- **r6**: Points to data[3]
+- **r7**: Points to data[4]
+- **r8**: Points to data[5]
+- **r9**: Points to data[6]
+- **r10**: Points to data[7]
+
+### Loop Control
+```assembly
+sub      r11, r11, r1, lsr #5       @ Initialize loop counter: -(stride >> 5)
+```
+This sets up a negative loop counter that increments toward zero.
+
+## Main Loop Structure
+
+### Loop Label (Line 103)
+The main processing loop starts at label `1:` and continues until `r11` reaches zero.
+
+## First Butterfly Stage (Lines 104-127)
+
+### Data Loading and Twiddle Factor Application
+```assembly
+vld1.32  {q2,  q3},  [r12, :128]!  @ Load twiddle factors (W), auto-increment
+vld1.32  {q14, q15}, [r6, :128]    @ Load data[3] complex pairs
+vld1.32  {q10, q11}, [r5, :128]    @ Load data[2] complex pairs
+```
+
+**NEON Register Layout:**
+- Each `q` register holds 4 float32 values
+- Complex numbers are stored as [Re0, Im0, Re1, Im1]
+- Two `q` registers together hold 4 complex numbers
+
+### Complex Multiplication for data[2] and data[3]
+```assembly
+@ Complex multiply data[3] with twiddle factors
+vmul.f32 q12, q15, q2    @ q12 = data[3].im * W.re
+vmul.f32 q8,  q14, q3    @ q8  = data[3].re * W.im
+vmul.f32 q13, q14, q2    @ q13 = data[3].re * W.re
+vmul.f32 q15, q15, q3    @ q15 = data[3].im * W.im
+
+@ Complex multiply data[2] with twiddle factors
+vmul.f32 q9,  q10, q3    @ q9  = data[2].re * W.im
+vmul.f32 q1,  q10, q2    @ q1  = data[2].re * W.re
+vmul.f32 q0,  q11, q2    @ q0  = data[2].im * W.re
+vmul.f32 q14, q11, q3    @ q14 = data[2].im * W.im
+```
+
+### Butterfly Computations
+```assembly
+@ First set of butterflies
+vsub.f32 q10, q12, q8     @ q10 = (d3.im * W.re) - (d3.re * W.im) = Im part
+vadd.f32 q11, q0,  q9     @ q11 = (d2.im * W.re) + (d2.re * W.im) = Im part
+vadd.f32 q8,  q15, q13    @ q8  = (d3.im * W.im) + (d3.re * W.re) = Re part
+vsub.f32 q9,  q1,  q14    @ q9  = (d2.re * W.re) - (d2.im * W.im) = Re part
+
+@ Load data[1]
+vld1.32  {q12, q13}, [r4, :128]
+
+@ More butterfly operations
+vsub.f32 q15, q11, q10    @ q15 = butterfly difference (Im)
+vsub.f32 q14, q9,  q8     @ q14 = butterfly difference (Re)
+vsub.f32 q4,  q12, q15    @ q4 = data[1].re - q15
+vadd.f32 q6,  q12, q15    @ q6 = data[1].re + q15
+vadd.f32 q5,  q13, q14    @ q5 = data[1].im + q14
+vsub.f32 q7,  q13, q14    @ q7 = data[1].im - q14
+```
+
+### Store First Results
+```assembly
+vst1.32  {q4,  q5},  [r4, :128]    @ Store to data[1]
+vst1.32  {q6,  q7},  [r6, :128]    @ Store to data[3]
+```
+
+## Second Butterfly Stage (Lines 128-174)
+
+### Load Second Set of Data
+```assembly
+vld1.32  {q14, q15}, [r9, :128]    @ Load data[6]
+vld1.32  {q12, q13}, [r7, :128]    @ Load data[4]
+vld1.32  {q2,  q3},  [r12, :128]!  @ Load next twiddle factors
+```
+
+### Complex Multiplication and Butterfly
+Similar pattern as the first stage, but operating on data[4], data[5], data[6], and data[7].
+
+```assembly
+@ Complex multiplications
+vmul.f32 q1,  q14, q2    @ data[6] * W
+vmul.f32 q0,  q14, q3
+vmul.f32 q14, q15, q3
+vmul.f32 q4,  q15, q2
+@ ... more multiplications for data[4] and data[5]
+
+@ Butterfly operations and combinations
+vadd.f32 q14, q14, q1
+vsub.f32 q13, q4,  q0
+@ ... more butterfly operations
+
+@ Load data[0]
+vld1.32  {q8,  q9},  [r3, :128]
+
+@ Final butterfly combinations
+vadd.f32 q11, q8,  q15   @ data[0] + result
+vsub.f32 q8,  q8,  q15   @ data[0] - result
+```
+
+## Third Butterfly Stage (Lines 175-199)
+
+### Final Stage Processing
+```assembly
+@ Load remaining data from updated locations
+vld1.32  {q8,  q9},  [r4, :128]    @ Reload data[1]
+vld1.32  {q10, q11}, [r6, :128]    @ Reload data[3]
+
+@ Final butterfly operations
+vadd.f32 q0,  q8,  q13    @ Final combinations
+vadd.f32 q1,  q9,  q12
+vsub.f32 q2,  q10, q15
+vadd.f32 q3,  q11, q14
+vsub.f32 q4,  q8,  q13
+vsub.f32 q5,  q9,  q12
+vadd.f32 q6,  q10, q15
+vsub.f32 q7,  q11, q14
+```
+
+### Store Final Results with Auto-increment
+```assembly
+vst1.32  {q0,  q1},  [r3, :128]!   @ Store to data[0], increment r3
+vst1.32  {q2,  q3},  [r5, :128]!   @ Store to data[2], increment r5
+vst1.32  {q4,  q5},  [r7, :128]!   @ Store to data[4], increment r7
+vst1.32  {q6,  q7},  [r9, :128]!   @ Store to data[6], increment r9
+vst1.32  {q0,  q1},  [r4, :128]!   @ Store to data[1], increment r4
+vst1.32  {q2,  q3},  [r6, :128]!   @ Store to data[3], increment r6
+vst1.32  {q4,  q5},  [r8, :128]!   @ Store to data[5], increment r8
+vst1.32  {q6,  q7},  [r10, :128]!  @ Store to data[7], increment r10
+```
+
+## Loop Control
+```assembly
+bne      1b                         @ Branch if r11 != 0
+bx       lr                         @ Return
+```
+
+## Key Observations for ARM64 Porting
+
+### 1. **Register Mapping**
+- ARM32 uses r0-r12 general purpose registers
+- ARM64 will use x0-x12 (64-bit) or w0-w12 (32-bit)
+- NEON registers q0-q15 in ARM32 map to v0-v15 in ARM64
+
+### 2. **Instruction Differences**
+- `vld1.32` → `ld1` with appropriate type specifier
+- `vst1.32` → `st1` with appropriate type specifier
+- `vmul.f32` → `fmul`
+- `vadd.f32` → `fadd`
+- `vsub.f32` → `fsub`
+
+### 3. **Addressing Modes**
+- ARM32: `[r0, :128]!` (128-bit aligned, post-increment)
+- ARM64: Similar syntax but with x registers
+
+### 4. **Complex Number Storage**
+- Real and imaginary parts are interleaved
+- Each q register holds 2 complex numbers
+- Butterfly operations maintain this interleaved format
+
+### 5. **Algorithm Structure**
+The function implements a radix-8 FFT using three stages of radix-2 butterflies:
+1. First stage: Process data[2,3] with data[0,1]
+2. Second stage: Process data[4,5,6,7] with previous results
+3. Third stage: Final combinations and output
+
+### 6. **Memory Access Pattern**
+- Strided access pattern based on input stride parameter
+- Auto-increment addressing used for efficiency
+- 128-bit aligned loads/stores for optimal performance
+
+### 7. **Twiddle Factor Application**
+- Twiddle factors are pre-computed and stored in LUT
+- Complex multiplication implemented using 4 real multiplications
+- Results combined using addition/subtraction
+
+## Critical Path Analysis
+The function has multiple data dependencies between stages, but within each stage, many operations can execute in parallel. The critical path involves:
+1. Load → Complex multiply → Butterfly → Store
+2. Inter-stage dependencies require careful scheduling
+
+## Performance Considerations
+- Uses 128-bit NEON registers efficiently
+- Minimizes memory accesses through register reuse
+- Processes multiple complex numbers simultaneously
+- Loop unrolling would be beneficial for larger transforms
diff --git a/neon_x8_porting_summary.md b/neon_x8_porting_summary.md
@@ -0,0 +1,141 @@
+# ARM32 NEON neon_x8 to ARM64 Porting Summary
+
+## Executive Summary
+
+The `neon_x8` function implements an 8-point FFT using ARM32 NEON SIMD instructions. It's a highly optimized in-place algorithm that processes complex numbers stored in an interleaved format (real, imaginary pairs). The function uses a radix-8 decomposition implemented as three stages of radix-2 butterflies.
+
+## Key Algorithm Characteristics
+
+### 1. **FFT Type**
+- 8-point Decimation-In-Time (DIT) FFT
+- In-place computation
+- Complex-to-complex transform
+- Three butterfly stages
+
+### 2. **Data Layout**
+- Complex numbers stored as interleaved real/imaginary pairs
+- Each NEON q register holds 2 complex numbers (4 floats)
+- Data accessed with stride to support larger transforms
+
+### 3. **Computational Pattern**
+```
+Stage 1: Butterflies on (data[2], data[3]) with (data[0], data[1])
+Stage 2: Butterflies on (data[4], data[5], data[6], data[7])
+Stage 3: Final combinations of all 8 points
+```
+
+## Register Allocation Strategy
+
+### NEON Registers (q0-q15)
+- **q0-q1**: Temporary computation registers
+- **q2-q3**: Twiddle factors from LUT
+- **q4-q7**: Butterfly results for storage
+- **q8-q15**: Data values and intermediate results
+
+### ARM Registers (r0-r12)
+- **r0-r2**: Function parameters (data, stride, LUT)
+- **r3-r10**: Pointers to 8 data elements
+- **r11**: Loop counter
+- **r12**: Current LUT pointer
+
+## Critical Implementation Details
+
+### 1. **Loop Structure**
+- Loop count = -(stride >> 5), increments to zero
+- Processes multiple 8-point FFTs in sequence
+- Auto-increment addressing for efficiency
+
+### 2. **Complex Multiplication**
+```
+(a + bi) * (c + di) = (ac - bd) + (ad + bc)i
+```
+Implemented using 4 real multiplications per complex multiply
+
+### 3. **Memory Access Pattern**
+- Initial loads from all 8 locations
+- Intermediate stores after each butterfly stage
+- Pointer auto-increment prepares for next iteration
+
+### 4. **Twiddle Factor Loading**
+- Pre-computed twiddle factors in LUT
+- Loaded sequentially with auto-increment
+- Two q registers per twiddle set (4 complex values)
+
+## ARM64 Porting Considerations
+
+### 1. **Instruction Mapping**
+| ARM32 | ARM64 |
+|-------|-------|
+| `vld1.32 {q0,q1}, [r0, :128]!` | `ld1 {v0.4s, v1.4s}, [x0], #32` |
+| `vst1.32 {q0,q1}, [r0, :128]` | `st1 {v0.4s, v1.4s}, [x0]` |
+| `vmul.f32 q0, q1, q2` | `fmul v0.4s, v1.4s, v2.4s` |
+| `vadd.f32 q0, q1, q2` | `fadd v0.4s, v1.4s, v2.4s` |
+| `vsub.f32 q0, q1, q2` | `fsub v0.4s, v1.4s, v2.4s` |
+
+### 2. **Register Mapping**
+| ARM32 | ARM64 |
+|-------|-------|
+| r0-r12 | x0-x12 (64-bit) or w0-w12 (32-bit) |
+| q0-q15 | v0-v15 (128-bit vectors) |
+
+### 3. **Addressing Modes**
+- ARM64 supports similar aligned loads with post-increment
+- Syntax: `[x0], #32` for 32-byte post-increment
+- Alignment hints: `:128` becomes implicit in ARM64
+
+### 4. **Optimization Opportunities**
+- ARM64 has 32 NEON registers (v0-v31) vs 16 in ARM32
+- Can reduce register pressure and memory accesses
+- Potential for better instruction scheduling
+- Consider using paired loads (ldp) where beneficial
+
+## Performance Critical Paths
+
+### 1. **Data Dependencies**
+```
+Load → Complex Multiply → Butterfly → Store
+   ↓
+Next stage butterflies depend on previous results
+```
+
+### 2. **Latency Hiding**
+- Early loads to hide memory latency
+- Interleaved arithmetic operations
+- Register reuse minimizes loads
+
+### 3. **Throughput Optimization**
+- SIMD processes 4 complex numbers per loop iteration
+- Aligned 128-bit loads/stores
+- Minimal memory traffic through in-place operation
+
+## Implementation Strategy for ARM64
+
+### 1. **Direct Translation**
+- Start with 1:1 instruction mapping
+- Maintain same register allocation strategy
+- Preserve memory access pattern
+
+### 2. **ARM64-Specific Optimizations**
+- Utilize additional v16-v31 registers
+- Consider SVE/SVE2 for scalable vectors
+- Explore fused multiply-add (fmla) opportunities
+- Use paired load/store where beneficial
+
+### 3. **Testing Considerations**
+- Verify bit-exact results with ARM32 version
+- Test with various stride values
+- Validate alignment requirements
+- Performance comparison on target hardware
+
+## Summary of Key Findings
+
+1. **Algorithm**: Three-stage radix-2 butterfly implementation of 8-point FFT
+2. **Data Format**: Interleaved complex numbers, 2 per NEON register
+3. **Memory Pattern**: In-place with specific butterfly groupings
+4. **Twiddle Factors**: Pre-computed, loaded from LUT
+5. **Loop Structure**: Processes multiple 8-point FFTs based on stride
+6. **Critical Path**: Load → Multiply → Butterfly → Store chain
+7. **Register Usage**: Near-optimal use of available NEON registers
+8. **Optimization**: Careful instruction scheduling for latency hiding
+
+This analysis provides the foundation for accurate ARM64 porting while maintaining the performance characteristics of the original ARM32 implementation.