yazon · yazon · Aug 15, 2025
diff --git a/neon_x8_t_analysis.md b/neon_x8_t_analysis.md
@@ -0,0 +1,325 @@
+# Detailed Analysis of neon_x8_t Macro (ARM32 NEON)
+
+## Overview
+The `neon_x8_t` function is an ARM32 NEON-optimized implementation of an 8-point FFT butterfly computation with transposed output (`_t` suffix indicates transpose). This function performs complex number arithmetic using NEON SIMD instructions for efficient parallel processing.
+
+## Function Signature and Parameters
+
+```c
+void neon_x8_t(float *data, size_t stride, float *LUT);
+```
+
+### Input Parameters (ARM calling convention):
+- **r0**: Base pointer to input/output data array (float*)
+- **r1**: Stride between data elements (size_t) 
+- **r2**: Pointer to Look-Up Table (LUT) containing twiddle factors (float*)
+
+## Register Initialization and Data Pointer Setup
+
+### Data Pointer Calculation (lines 212-220):
+```assembly
+add      r3, r0, #0           @ data0 = r0
+add      r5, r0, r1, lsl #1   @ data2 = r0 + 2*r1
+add      r4, r0, r1           @ data1 = r0 + r1
+add      r7, r5, r1, lsl #1   @ data4 = r0 + 4*r1
+add      r6, r5, r1           @ data3 = r0 + 3*r1
+add      r9, r7, r1, lsl #1   @ data6 = r0 + 6*r1
+add      r8, r7, r1           @ data5 = r0 + 5*r1
+add      r10, r9, r1          @ data7 = r0 + 7*r1
+add      r12, r2, #0          @ LUT pointer = r2
+```
+
+### Register Mapping:
+- **r3**: Points to data[0]
+- **r4**: Points to data[1] 
+- **r5**: Points to data[2]
+- **r6**: Points to data[3]
+- **r7**: Points to data[4]
+- **r8**: Points to data[5]
+- **r9**: Points to data[6]
+- **r10**: Points to data[7]
+- **r12**: LUT pointer for twiddle factors
+- **r11**: Loop counter (negative, counts up to 0)
+
+## Loop Structure
+
+### Loop Initialization (lines 211, 222):
+```assembly
+mov      r11, #0
+sub      r11, r11, r1, lsr #5    @ r11 = -(r1 >> 5) = -(stride/32)
+```
+
+The loop processes data in chunks, with the iteration count based on the stride divided by 32.
+
+### Loop Control (lines 227, 320):
+```assembly
+adds     r11, r11, #1            @ Increment counter
+...
+bne      1b                      @ Branch if not equal (loop while r11 != 0)
+```
+
+## Detailed Butterfly Computation Analysis
+
+### Stage 1: First Butterfly Layer (lines 224-261)
+
+#### Load Operations:
+```assembly
+vld1.32  {q2,  q3},  [r12, :128]!   @ Load twiddle factors W0, W1
+vld1.32  {q14, q15}, [r6, :128]     @ Load data[3] complex pairs
+vld1.32  {q10, q11}, [r5, :128]     @ Load data[2] complex pairs
+```
+
+#### Complex Multiplication with Twiddle Factors:
+The code performs complex multiplication: (a + bi) * (c + di) = (ac - bd) + (ad + bc)i
+
+```assembly
+vmul.f32 q12, q15, q2    @ q12 = data[3].imag * W0.real
+vmul.f32 q8,  q14, q3    @ q8  = data[3].real * W0.imag
+vmul.f32 q13, q14, q2    @ q13 = data[3].real * W0.real
+vmul.f32 q9,  q10, q3    @ q9  = data[2].real * W1.imag
+vmul.f32 q1,  q10, q2    @ q1  = data[2].real * W1.real
+vmul.f32 q0,  q11, q2    @ q0  = data[2].imag * W1.real
+vmul.f32 q14, q11, q3    @ q14 = data[2].imag * W1.imag
+vmul.f32 q15, q15, q3    @ q15 = data[3].imag * W1.imag
+```
+
+#### Detailed Register Contents After Multiplication:
+- **q12**: data[3].imag * W0.real (part of data[3] * W0)
+- **q8**: data[3].real * W0.imag (part of data[3] * W0)
+- **q13**: data[3].real * W0.real (part of data[3] * W0)
+- **q15**: data[3].imag * W0.imag (part of data[3] * W0)
+- **q1**: data[2].real * W1.real (part of data[2] * W1)
+- **q0**: data[2].imag * W1.real (part of data[2] * W1)
+- **q9**: data[2].real * W1.imag (part of data[2] * W1)
+- **q14**: data[2].imag * W1.imag (part of data[2] * W1)
+
+#### Butterfly Calculations:
+```assembly
+vsub.f32 q10, q12, q8    @ q10 = temp complex result
+vadd.f32 q11, q0,  q9    @ q11 = temp complex result
+vadd.f32 q8,  q15, q13   @ q8  = temp complex result
+vsub.f32 q9,  q1,  q14   @ q9  = temp complex result
+```
+
+### Stage 2: Second Butterfly Layer (lines 240-279)
+
+This stage combines results from Stage 1 with data[0] and data[1]:
+
+```assembly
+vld1.32  {q12, q13}, [r4, :128]     @ Load data[1]
+vld1.32  {q8,  q9},  [r3, :128]     @ Load data[0]
+
+@ Butterfly operations combining all inputs
+vadd.f32 q11, q8,  q15              @ q11 = data[0] + processed_data
+vsub.f32 q8,  q8,  q15              @ q8  = data[0] - processed_data
+...
+```
+
+### Stage 3: Third Butterfly Layer (lines 282-319)
+
+This stage processes data[4-7] with similar butterfly operations:
+
+```assembly
+vld1.32  {q14, q15}, [r10, :128]    @ Load data[7]
+vld1.32  {q12, q13}, [r8, :128]     @ Load data[5]
+vld1.32  {q2,  q3},  [r12, :128]!   @ Load next twiddle factors
+```
+
+## Memory Operations and Data Layout
+
+### Input Data Format:
+- Data is stored as interleaved complex numbers: [real0, imag0, real1, imag1, ...]
+- Each `vld1.32 {q, q}` loads 8 floats = 4 complex numbers
+
+### Output Operations:
+The function uses `vst2.32` (store 2-element structures) for transposed output:
+
+```assembly
+vst2.32  {q0,  q1},  [r3, :128]!    @ Store to data[0] with transpose
+vst2.32  {q2,  q3},  [r5, :128]!    @ Store to data[2] with transpose
+vst2.32  {q4,  q5},  [r7, :128]!    @ Store to data[4] with transpose
+vst2.32  {q6,  q7},  [r9, :128]!    @ Store to data[6] with transpose
+```
+
+The `vst2.32` instruction automatically de-interleaves the data, effectively performing a transpose operation.
+
+## Register Usage Summary
+
+### NEON Registers (128-bit vectors):
+- **q0-q1**: Temporary calculation results, output data
+- **q2-q3**: Twiddle factors from LUT
+- **q4-q7**: Temporary results, output data
+- **q8-q15**: Input data, intermediate calculations
+
+### ARM Registers:
+- **r0**: Base data pointer (preserved)
+- **r1**: Stride value (preserved)
+- **r2**: LUT pointer (preserved)
+- **r3-r10**: Data pointers for 8-point access
+- **r11**: Loop counter
+- **r12**: Current LUT pointer (advances through loop)
+
+## Key Optimization Techniques
+
+1. **SIMD Parallelism**: Processes 4 complex numbers simultaneously using NEON vectors
+2. **Register Blocking**: Maximizes register reuse to minimize memory access
+3. **Transposed Output**: Uses `vst2.32` for efficient transpose during store
+4. **Pointer Arithmetic**: Pre-calculates all data pointers for efficient addressing
+5. **Loop Unrolling**: Processes complete 8-point FFTs in each iteration
+
+## Real and Imaginary Part Storage
+
+Throughout the computation:
+- Even lanes (0, 2) typically hold real parts
+- Odd lanes (1, 3) typically hold imaginary parts
+- The `vst2.32` instruction separates these into distinct memory regions
+
+## Summary
+
+The `neon_x8_t` function implements a highly optimized 8-point FFT with transpose using ARM32 NEON SIMD instructions. It processes multiple 8-point FFTs in parallel, with each iteration handling 4 complex numbers per data point. The transpose operation is efficiently integrated into the store operations, making this suitable for multi-dimensional FFT implementations where a transpose is needed between passes.
+
+## 8-Point FFT Butterfly Structure
+
+```
+Input:  X[0], X[1], X[2], X[3], X[4], X[5], X[6], X[7]
+
+Stage 1: Bit-reversal and initial butterflies
+         X[0] ----+---------> 
+                  |
+         X[4] ----+--W^0---->
+
+         X[2] ----+---------> 
+                  |
+         X[6] ----+--W^2---->
+
+         X[1] ----+---------> 
+                  |
+         X[5] ----+--W^1---->
+
+         X[3] ----+---------> 
+                  |
+         X[7] ----+--W^3---->
+
+Stage 2: Second layer butterflies
+         Results combined with twiddle factors
+
+Stage 3: Final butterflies
+         Output with transpose
+```
+
+## Detailed Execution Flow
+
+### First Part of Loop Body (Processing Lower Half):
+
+1. **Load twiddle factors and data[2,3]**
+   - q2, q3 = W[i], W[i+1] (twiddle factors)
+   - q14, q15 = data[3] (4 complex numbers)
+   - q10, q11 = data[2] (4 complex numbers)
+
+2. **Complex multiplication data[2,3] * W**
+   - Each complex multiplication requires 4 real multiplications
+   - Results stored in temporary registers
+
+3. **First set of butterflies**
+   - Combines multiplication results
+   - Prepares intermediate values
+
+4. **Load data[0,1] and combine**
+   - q12, q13 = data[1]
+   - q8, q9 = data[0]
+   - Butterfly operations combine all values
+
+5. **Store results with transpose**
+   - vst2.32 automatically de-interleaves real/imaginary parts
+
+### Second Part of Loop Body (Processing Upper Half):
+
+Similar process for data[4,5,6,7] with appropriate twiddle factors.
+
+## Modified Registers Throughout Execution
+
+### Registers Modified in Each Iteration:
+1. **q0-q15**: All NEON registers are used and modified
+2. **r11**: Loop counter incremented
+3. **r12**: LUT pointer advanced by post-increment addressing
+4. **Memory pointed to by r3-r10**: Output data written back
+
+### Preserved Registers:
+- **r0**: Original data pointer
+- **r1**: Stride value  
+- **r2**: Original LUT pointer
+- **r3-r10**: Data pointers (values preserved, memory modified)
+
+## Complex Number Layout in NEON Registers
+
+Each q register (128-bit) holds 4 float values representing 2 complex numbers:
+```
+q register = [real0, imag0, real1, imag1]
+```
+
+When using vld1.32 {q14, q15}:
+- q14 = [r0, i0, r1, i1] (first 2 complex numbers)
+- q15 = [r2, i2, r3, i3] (next 2 complex numbers)
+
+## Twiddle Factor Application
+
+The twiddle factors (W) are complex exponentials used in FFT:
+- W = e^(-2πi*k/N) where N is FFT size, k is the index
+- Stored as [real, imag] pairs in the LUT
+- Applied through complex multiplication
+
+## Performance Characteristics
+
+1. **Instruction Count**: ~96 instructions per loop iteration
+2. **Memory Access**: 
+   - 8 loads (data) + 3 loads (twiddle) = 11 loads
+   - 8 stores (all with transpose)
+3. **Arithmetic Operations**: ~32 multiplications, ~32 additions/subtractions
+4. **Data Processed**: 32 complex numbers per iteration (4 per point × 8 points)
+
+## Key Differences from Non-Transposed Version
+
+The main difference is in the store operations:
+- Regular version: Uses vst1.32 (standard store)
+- Transposed version: Uses vst2.32 (de-interleaving store)
+
+This allows efficient matrix transpose during the FFT computation, eliminating the need for a separate transpose pass.
+
+## ARM64 Porting Considerations
+
+When porting this code to ARM64 (AArch64), consider the following key differences:
+
+### Register Changes:
+1. **General Purpose Registers**: r0-r12 → x0-x12 (64-bit) or w0-w12 (32-bit)
+2. **NEON Registers**: q0-q15 → v0-v31 (doubled register count)
+3. **No separate VFP/NEON mode** in ARM64
+
+### Instruction Changes:
+1. **vld1.32** → **ld1** with appropriate type specifier
+2. **vst2.32** → **st2** with appropriate type specifier
+3. **vmul.f32** → **fmul** with vector specifier
+4. **vadd.f32** → **fadd** with vector specifier
+5. **vsub.f32** → **fsub** with vector specifier
+
+### Addressing Mode Changes:
+1. Post-increment syntax changes: `[r12, :128]!` → `[x12], #32`
+2. Shifted register syntax: `r1, lsl #1` → `x1, lsl #1` (similar but with x registers)
+
+### Optimization Opportunities in ARM64:
+1. **More registers**: Can potentially unroll further or reduce memory pressure
+2. **Larger vectors**: Can process more data in parallel with 128-bit vectors
+3. **Better instruction scheduling**: More flexible pipeline in modern ARM64 cores
+4. **FMA instructions**: Can use fused multiply-add for better performance
+
+### Example ARM64 Translation Pattern:
+```assembly
+@ ARM32:
+vld1.32  {q2,  q3},  [r12, :128]!
+vmul.f32 q12, q15, q2
+
+@ ARM64:
+ld1      {v2.4s, v3.4s}, [x12], #32
+fmul     v12.4s, v15.4s, v2.4s
+```
+
+This analysis provides the foundation needed for an accurate port to ARM64 architecture.