Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions neon_oo_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Detailed Analysis of `neon_oo` Macro in ARM32 NEON Assembly

## Overview

The `neon_oo` macro is a critical component in the ARM32 NEON assembly implementation of FFT (Fast Fourier Transform) butterflies. This macro performs complex number arithmetic operations on 8 complex numbers simultaneously, utilizing ARM32 NEON SIMD instructions for optimal performance.

## Register Setup and Input Parameters

### Input Register Mapping
- **r0**: Output buffer pointer (base address for storing results)
- **r1**: Pointer to FFT plan structure
- **r2**: Temporary register (used for calculated output addresses)
- **r3-r10**: Input data pointers to 8 different complex number arrays
- **r11**: Loop iteration counter (decremented each iteration)
- **r12**: Offset array pointer (contains output indices)
- **lr**: Temporary register (used for calculated output addresses)

### Register Setup Before Call
The registers r3-r10 are set up through a series of ADDI instructions that calculate offsets from the base input pointer:
```arm
ADDI(&fp, 2, 7, 0); // r2 = r7
ADDI(&fp, 7, 9, 0); // r7 = r9
ADDI(&fp, 9, 2, 0); // r9 = r2
// ... similar pattern for r8, r10
```

## Algorithm Analysis - Butterfly Computations

### Data Loading Pattern
The macro uses `vld2.32` instructions to load interleaved complex data:
```arm
vld2.32 {q8}, [r6, :128]! // Load 2 complex numbers from r6
vld2.32 {q9}, [r5, :128]! // Load 2 complex numbers from r5
vld2.32 {q10}, [r4, :128]! // Load 2 complex numbers from r4
vld2.32 {q13}, [r3, :128]! // Load 2 complex numbers from r3
```

The `vld2.32` instruction de-interleaves the data:
- First element of pair → Even lanes (real parts)
- Second element of pair → Odd lanes (imaginary parts)

### NEON Register Layout

#### Q Register to D Register Mapping
- **q8** = {d16, d17}
- **q9** = {d18, d19}
- **q10** = {d20, d21}
- **q11** = {d22, d23}
- **q12** = {d24, d25}
- **q13** = {d26, d27}

#### Complex Number Storage
For each Q register holding 2 complex numbers:
- **d[2n]** (even D register): Real parts [Re0, Re1]
- **d[2n+1]** (odd D register): Imaginary parts [Im0, Im1]

### Butterfly Computation Stages

#### Stage 1: First Level Butterflies (Lines 423-426)
```arm
vadd.f32 q11, q9, q8 // q11 = q9 + q8 (complex addition)
vsub.f32 q8, q9, q8 // q8 = q9 - q8 (complex subtraction)
vsub.f32 q9, q13, q10 // q9 = q13 - q10
vadd.f32 q12, q13, q10 // q12 = q13 + q10
```

#### Stage 2: Load More Data (Lines 428-429)
```arm
vld2.32 {q10}, [r7, :128]! // Load from r7
vld2.32 {q13}, [r9, :128]! // Load from r9
```

#### Stage 3: Second Level Butterflies (Lines 430-434)
Operations on D registers for finer control:
```arm
vsub.f32 q2, q12, q11 // q2 = q12 - q11
vsub.f32 d7, d19, d16 // Imaginary parts subtraction
vadd.f32 d3, d19, d16 // Imaginary parts addition
vadd.f32 d6, d18, d17 // Real parts addition
vsub.f32 d2, d18, d17 // Real parts subtraction
```

#### Stage 4: Load Final Data Set (Lines 435-436)
```arm
vld2.32 {q9}, [r8, :128]! // Load from r8
vld2.32 {q8}, [r10, :128]! // Load from r10
```

#### Stage 5: Third Level Butterflies (Lines 437-443)
```arm
vadd.f32 q0, q12, q11 // Final sum
vadd.f32 q11, q13, q8
vadd.f32 q12, q10, q9
vsub.f32 q8, q13, q8
vsub.f32 q9, q10, q9
vsub.f32 q6, q12, q11
vadd.f32 q4, q12, q11
```

#### Stage 6: Data Transposition (Lines 444, 452, 456-457)
```arm
vtrn.32 q0, q2 // Transpose for output format
vtrn.32 q1, q3
vtrn.32 q4, q6
vtrn.32 q5, q7
```

### Memory Access Pattern

#### Input Access
- Uses post-increment addressing: `[rN, :128]!`
- `:128` indicates 128-bit alignment
- `!` updates the pointer after load
- Each iteration processes 8 complex pairs (16 float values)

#### Output Access
1. Load offset indices from r12:
```arm
ldr r2, [r12], #4
ldr lr, [r12], #4
```
2. Calculate output addresses:
```arm
add r2, r0, r2, lsl #2 // r2 = r0 + (offset * 4)
add lr, r0, lr, lsl #2 // lr = r0 + (offset * 4)
```
3. Store results using `vst2.32` with interleaving:
```arm
vst2.32 {q0, q1}, [r2, :128]!
vst2.32 {q2, q3}, [lr, :128]!
vst2.32 {q4, q5}, [r2, :128]!
vst2.32 {q6, q7}, [lr, :128]!
```

## Register Usage Summary

### Modified NEON Registers
- **q0-q2**: Final butterfly results (first set)
- **q4-q6**: Final butterfly results (second set)
- **q8-q13**: Temporary computation registers

### Modified ARM Registers
- **r2, lr**: Temporary output address calculations
- **r3-r10**: Updated by post-increment loads
- **r11**: Decremented loop counter
- **r12**: Updated by post-increment offset loads

### Preserved Registers
- **r0**: Output base pointer (unchanged)
- **r1**: FFT plan pointer (unused in this macro)

## Performance Considerations

1. **Instruction Scheduling**: Loads are interleaved with computations to hide memory latency
2. **SIMD Utilization**: Processes 8 complex numbers (16 floats) per iteration
3. **Memory Alignment**: Uses 128-bit aligned accesses for optimal performance
4. **Loop Structure**: Simple decrement and branch pattern for minimal overhead

## Key Differences for ARM64 Porting

1. **Register Names**:
- Q registers → V registers (v0-v31)
- More registers available (32 vs 16)

2. **Instruction Differences**:
- Some operations may have different mnemonics
- Better support for complex multiply operations

3. **Addressing Modes**:
- ARM64 has more flexible addressing options
- May allow for more efficient pointer management

4. **Performance Opportunities**:
- Wider SIMD registers (optional 256-bit SVE)
- More registers reduce register pressure
- Better instruction scheduling opportunities

## Conclusion

The `neon_oo` macro implements an efficient 8-point FFT butterfly operation using ARM32 NEON SIMD instructions. It carefully manages register allocation, uses efficient memory access patterns, and interleaves operations to maximize throughput. The algorithm processes complex numbers in an interleaved format and produces results suitable for subsequent FFT stages.