Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
325 changes: 325 additions & 0 deletions neon_x8_t_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
# Detailed Analysis of neon_x8_t Macro (ARM32 NEON)

## Overview
The `neon_x8_t` function is an ARM32 NEON-optimized implementation of an 8-point FFT butterfly computation with transposed output (`_t` suffix indicates transpose). This function performs complex number arithmetic using NEON SIMD instructions for efficient parallel processing.

## Function Signature and Parameters

```c
void neon_x8_t(float *data, size_t stride, float *LUT);
```

### Input Parameters (ARM calling convention):
- **r0**: Base pointer to input/output data array (float*)
- **r1**: Stride between data elements (size_t)
- **r2**: Pointer to Look-Up Table (LUT) containing twiddle factors (float*)

## Register Initialization and Data Pointer Setup

### Data Pointer Calculation (lines 212-220):
```assembly
add r3, r0, #0 @ data0 = r0
add r5, r0, r1, lsl #1 @ data2 = r0 + 2*r1
add r4, r0, r1 @ data1 = r0 + r1
add r7, r5, r1, lsl #1 @ data4 = r0 + 4*r1
add r6, r5, r1 @ data3 = r0 + 3*r1
add r9, r7, r1, lsl #1 @ data6 = r0 + 6*r1
add r8, r7, r1 @ data5 = r0 + 5*r1
add r10, r9, r1 @ data7 = r0 + 7*r1
add r12, r2, #0 @ LUT pointer = r2
```

### Register Mapping:
- **r3**: Points to data[0]
- **r4**: Points to data[1]
- **r5**: Points to data[2]
- **r6**: Points to data[3]
- **r7**: Points to data[4]
- **r8**: Points to data[5]
- **r9**: Points to data[6]
- **r10**: Points to data[7]
- **r12**: LUT pointer for twiddle factors
- **r11**: Loop counter (negative, counts up to 0)

## Loop Structure

### Loop Initialization (lines 211, 222):
```assembly
mov r11, #0
sub r11, r11, r1, lsr #5 @ r11 = -(r1 >> 5) = -(stride/32)
```

The loop processes data in chunks, with the iteration count based on the stride divided by 32.

### Loop Control (lines 227, 320):
```assembly
adds r11, r11, #1 @ Increment counter
...
bne 1b @ Branch if not equal (loop while r11 != 0)
```

## Detailed Butterfly Computation Analysis

### Stage 1: First Butterfly Layer (lines 224-261)

#### Load Operations:
```assembly
vld1.32 {q2, q3}, [r12, :128]! @ Load twiddle factors W0, W1
vld1.32 {q14, q15}, [r6, :128] @ Load data[3] complex pairs
vld1.32 {q10, q11}, [r5, :128] @ Load data[2] complex pairs
```

#### Complex Multiplication with Twiddle Factors:
The code performs complex multiplication: (a + bi) * (c + di) = (ac - bd) + (ad + bc)i

```assembly
vmul.f32 q12, q15, q2 @ q12 = data[3].imag * W0.real
vmul.f32 q8, q14, q3 @ q8 = data[3].real * W0.imag
vmul.f32 q13, q14, q2 @ q13 = data[3].real * W0.real
vmul.f32 q9, q10, q3 @ q9 = data[2].real * W1.imag
vmul.f32 q1, q10, q2 @ q1 = data[2].real * W1.real
vmul.f32 q0, q11, q2 @ q0 = data[2].imag * W1.real
vmul.f32 q14, q11, q3 @ q14 = data[2].imag * W1.imag
vmul.f32 q15, q15, q3 @ q15 = data[3].imag * W1.imag
```

#### Detailed Register Contents After Multiplication:
- **q12**: data[3].imag * W0.real (part of data[3] * W0)
- **q8**: data[3].real * W0.imag (part of data[3] * W0)
- **q13**: data[3].real * W0.real (part of data[3] * W0)
- **q15**: data[3].imag * W0.imag (part of data[3] * W0)
- **q1**: data[2].real * W1.real (part of data[2] * W1)
- **q0**: data[2].imag * W1.real (part of data[2] * W1)
- **q9**: data[2].real * W1.imag (part of data[2] * W1)
- **q14**: data[2].imag * W1.imag (part of data[2] * W1)

#### Butterfly Calculations:
```assembly
vsub.f32 q10, q12, q8 @ q10 = temp complex result
vadd.f32 q11, q0, q9 @ q11 = temp complex result
vadd.f32 q8, q15, q13 @ q8 = temp complex result
vsub.f32 q9, q1, q14 @ q9 = temp complex result
```

### Stage 2: Second Butterfly Layer (lines 240-279)

This stage combines results from Stage 1 with data[0] and data[1]:

```assembly
vld1.32 {q12, q13}, [r4, :128] @ Load data[1]
vld1.32 {q8, q9}, [r3, :128] @ Load data[0]

@ Butterfly operations combining all inputs
vadd.f32 q11, q8, q15 @ q11 = data[0] + processed_data
vsub.f32 q8, q8, q15 @ q8 = data[0] - processed_data
...
```

### Stage 3: Third Butterfly Layer (lines 282-319)

This stage processes data[4-7] with similar butterfly operations:

```assembly
vld1.32 {q14, q15}, [r10, :128] @ Load data[7]
vld1.32 {q12, q13}, [r8, :128] @ Load data[5]
vld1.32 {q2, q3}, [r12, :128]! @ Load next twiddle factors
```

## Memory Operations and Data Layout

### Input Data Format:
- Data is stored as interleaved complex numbers: [real0, imag0, real1, imag1, ...]
- Each `vld1.32 {q, q}` loads 8 floats = 4 complex numbers

### Output Operations:
The function uses `vst2.32` (store 2-element structures) for transposed output:

```assembly
vst2.32 {q0, q1}, [r3, :128]! @ Store to data[0] with transpose
vst2.32 {q2, q3}, [r5, :128]! @ Store to data[2] with transpose
vst2.32 {q4, q5}, [r7, :128]! @ Store to data[4] with transpose
vst2.32 {q6, q7}, [r9, :128]! @ Store to data[6] with transpose
```

The `vst2.32` instruction automatically de-interleaves the data, effectively performing a transpose operation.

## Register Usage Summary

### NEON Registers (128-bit vectors):
- **q0-q1**: Temporary calculation results, output data
- **q2-q3**: Twiddle factors from LUT
- **q4-q7**: Temporary results, output data
- **q8-q15**: Input data, intermediate calculations

### ARM Registers:
- **r0**: Base data pointer (preserved)
- **r1**: Stride value (preserved)
- **r2**: LUT pointer (preserved)
- **r3-r10**: Data pointers for 8-point access
- **r11**: Loop counter
- **r12**: Current LUT pointer (advances through loop)

## Key Optimization Techniques

1. **SIMD Parallelism**: Processes 4 complex numbers simultaneously using NEON vectors
2. **Register Blocking**: Maximizes register reuse to minimize memory access
3. **Transposed Output**: Uses `vst2.32` for efficient transpose during store
4. **Pointer Arithmetic**: Pre-calculates all data pointers for efficient addressing
5. **Loop Unrolling**: Processes complete 8-point FFTs in each iteration

## Real and Imaginary Part Storage

Throughout the computation:
- Even lanes (0, 2) typically hold real parts
- Odd lanes (1, 3) typically hold imaginary parts
- The `vst2.32` instruction separates these into distinct memory regions

## Summary

The `neon_x8_t` function implements a highly optimized 8-point FFT with transpose using ARM32 NEON SIMD instructions. It processes multiple 8-point FFTs in parallel, with each iteration handling 4 complex numbers per data point. The transpose operation is efficiently integrated into the store operations, making this suitable for multi-dimensional FFT implementations where a transpose is needed between passes.

## 8-Point FFT Butterfly Structure

```
Input: X[0], X[1], X[2], X[3], X[4], X[5], X[6], X[7]

Stage 1: Bit-reversal and initial butterflies
X[0] ----+--------->
|
X[4] ----+--W^0---->

X[2] ----+--------->
|
X[6] ----+--W^2---->

X[1] ----+--------->
|
X[5] ----+--W^1---->

X[3] ----+--------->
|
X[7] ----+--W^3---->

Stage 2: Second layer butterflies
Results combined with twiddle factors

Stage 3: Final butterflies
Output with transpose
```

## Detailed Execution Flow

### First Part of Loop Body (Processing Lower Half):

1. **Load twiddle factors and data[2,3]**
- q2, q3 = W[i], W[i+1] (twiddle factors)
- q14, q15 = data[3] (4 complex numbers)
- q10, q11 = data[2] (4 complex numbers)

2. **Complex multiplication data[2,3] * W**
- Each complex multiplication requires 4 real multiplications
- Results stored in temporary registers

3. **First set of butterflies**
- Combines multiplication results
- Prepares intermediate values

4. **Load data[0,1] and combine**
- q12, q13 = data[1]
- q8, q9 = data[0]
- Butterfly operations combine all values

5. **Store results with transpose**
- vst2.32 automatically de-interleaves real/imaginary parts

### Second Part of Loop Body (Processing Upper Half):

Similar process for data[4,5,6,7] with appropriate twiddle factors.

## Modified Registers Throughout Execution

### Registers Modified in Each Iteration:
1. **q0-q15**: All NEON registers are used and modified
2. **r11**: Loop counter incremented
3. **r12**: LUT pointer advanced by post-increment addressing
4. **Memory pointed to by r3-r10**: Output data written back

### Preserved Registers:
- **r0**: Original data pointer
- **r1**: Stride value
- **r2**: Original LUT pointer
- **r3-r10**: Data pointers (values preserved, memory modified)

## Complex Number Layout in NEON Registers

Each q register (128-bit) holds 4 float values representing 2 complex numbers:
```
q register = [real0, imag0, real1, imag1]
```

When using vld1.32 {q14, q15}:
- q14 = [r0, i0, r1, i1] (first 2 complex numbers)
- q15 = [r2, i2, r3, i3] (next 2 complex numbers)

## Twiddle Factor Application

The twiddle factors (W) are complex exponentials used in FFT:
- W = e^(-2πi*k/N) where N is FFT size, k is the index
- Stored as [real, imag] pairs in the LUT
- Applied through complex multiplication

## Performance Characteristics

1. **Instruction Count**: ~96 instructions per loop iteration
2. **Memory Access**:
- 8 loads (data) + 3 loads (twiddle) = 11 loads
- 8 stores (all with transpose)
3. **Arithmetic Operations**: ~32 multiplications, ~32 additions/subtractions
4. **Data Processed**: 32 complex numbers per iteration (4 per point × 8 points)

## Key Differences from Non-Transposed Version

The main difference is in the store operations:
- Regular version: Uses vst1.32 (standard store)
- Transposed version: Uses vst2.32 (de-interleaving store)

This allows efficient matrix transpose during the FFT computation, eliminating the need for a separate transpose pass.

## ARM64 Porting Considerations

When porting this code to ARM64 (AArch64), consider the following key differences:

### Register Changes:
1. **General Purpose Registers**: r0-r12 → x0-x12 (64-bit) or w0-w12 (32-bit)
2. **NEON Registers**: q0-q15 → v0-v31 (doubled register count)
3. **No separate VFP/NEON mode** in ARM64

### Instruction Changes:
1. **vld1.32** → **ld1** with appropriate type specifier
2. **vst2.32** → **st2** with appropriate type specifier
3. **vmul.f32** → **fmul** with vector specifier
4. **vadd.f32** → **fadd** with vector specifier
5. **vsub.f32** → **fsub** with vector specifier

### Addressing Mode Changes:
1. Post-increment syntax changes: `[r12, :128]!` → `[x12], #32`
2. Shifted register syntax: `r1, lsl #1` → `x1, lsl #1` (similar but with x registers)

### Optimization Opportunities in ARM64:
1. **More registers**: Can potentially unroll further or reduce memory pressure
2. **Larger vectors**: Can process more data in parallel with 128-bit vectors
3. **Better instruction scheduling**: More flexible pipeline in modern ARM64 cores
4. **FMA instructions**: Can use fused multiply-add for better performance

### Example ARM64 Translation Pattern:
```assembly
@ ARM32:
vld1.32 {q2, q3}, [r12, :128]!
vmul.f32 q12, q15, q2

@ ARM64:
ld1 {v2.4s, v3.4s}, [x12], #32
fmul v12.4s, v15.4s, v2.4s
```

This analysis provides the foundation needed for an accurate port to ARM64 architecture.