Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions neon_x8_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Detailed Analysis of ARM32 NEON `neon_x8` Function

## Overview
The `neon_x8` function implements an 8-point FFT (Fast Fourier Transform) using ARM32 NEON SIMD instructions. This is a highly optimized assembly routine that processes complex number arrays using vectorized operations.

## Function Signature
```assembly
neon_x8:
@ r0 = data pointer (input/output buffer)
@ r1 = stride (distance between elements in bytes)
@ r2 = twiddle factor lookup table (LUT) pointer
```

## Register Usage and Setup

### Initial Register Assignment (Lines 91-100)
```assembly
mov r11, #0 @ Loop counter initialization
add r3, r0, #0 @ data0 = base address
add r5, r0, r1, lsl #1 @ data2 = base + 2*stride
add r4, r0, r1 @ data1 = base + stride
add r7, r5, r1, lsl #1 @ data4 = base + 4*stride
add r6, r5, r1 @ data3 = base + 3*stride
add r9, r7, r1, lsl #1 @ data6 = base + 6*stride
add r8, r7, r1 @ data5 = base + 5*stride
add r10, r9, r1 @ data7 = base + 7*stride
add r12, r2, #0 @ LUT pointer
```

### Data Pointer Layout
- **r3**: Points to data[0]
- **r4**: Points to data[1]
- **r5**: Points to data[2]
- **r6**: Points to data[3]
- **r7**: Points to data[4]
- **r8**: Points to data[5]
- **r9**: Points to data[6]
- **r10**: Points to data[7]

### Loop Control
```assembly
sub r11, r11, r1, lsr #5 @ Initialize loop counter: -(stride >> 5)
```
This sets up a negative loop counter that increments toward zero.

## Main Loop Structure

### Loop Label (Line 103)
The main processing loop starts at label `1:` and continues until `r11` reaches zero.

## First Butterfly Stage (Lines 104-127)

### Data Loading and Twiddle Factor Application
```assembly
vld1.32 {q2, q3}, [r12, :128]! @ Load twiddle factors (W), auto-increment
vld1.32 {q14, q15}, [r6, :128] @ Load data[3] complex pairs
vld1.32 {q10, q11}, [r5, :128] @ Load data[2] complex pairs
```

**NEON Register Layout:**
- Each `q` register holds 4 float32 values
- Complex numbers are stored as [Re0, Im0, Re1, Im1]
- Two `q` registers together hold 4 complex numbers

### Complex Multiplication for data[2] and data[3]
```assembly
@ Complex multiply data[3] with twiddle factors
vmul.f32 q12, q15, q2 @ q12 = data[3].im * W.re
vmul.f32 q8, q14, q3 @ q8 = data[3].re * W.im
vmul.f32 q13, q14, q2 @ q13 = data[3].re * W.re
vmul.f32 q15, q15, q3 @ q15 = data[3].im * W.im

@ Complex multiply data[2] with twiddle factors
vmul.f32 q9, q10, q3 @ q9 = data[2].re * W.im
vmul.f32 q1, q10, q2 @ q1 = data[2].re * W.re
vmul.f32 q0, q11, q2 @ q0 = data[2].im * W.re
vmul.f32 q14, q11, q3 @ q14 = data[2].im * W.im
```

### Butterfly Computations
```assembly
@ First set of butterflies
vsub.f32 q10, q12, q8 @ q10 = (d3.im * W.re) - (d3.re * W.im) = Im part
vadd.f32 q11, q0, q9 @ q11 = (d2.im * W.re) + (d2.re * W.im) = Im part
vadd.f32 q8, q15, q13 @ q8 = (d3.im * W.im) + (d3.re * W.re) = Re part
vsub.f32 q9, q1, q14 @ q9 = (d2.re * W.re) - (d2.im * W.im) = Re part

@ Load data[1]
vld1.32 {q12, q13}, [r4, :128]

@ More butterfly operations
vsub.f32 q15, q11, q10 @ q15 = butterfly difference (Im)
vsub.f32 q14, q9, q8 @ q14 = butterfly difference (Re)
vsub.f32 q4, q12, q15 @ q4 = data[1].re - q15
vadd.f32 q6, q12, q15 @ q6 = data[1].re + q15
vadd.f32 q5, q13, q14 @ q5 = data[1].im + q14
vsub.f32 q7, q13, q14 @ q7 = data[1].im - q14
```

### Store First Results
```assembly
vst1.32 {q4, q5}, [r4, :128] @ Store to data[1]
vst1.32 {q6, q7}, [r6, :128] @ Store to data[3]
```

## Second Butterfly Stage (Lines 128-174)

### Load Second Set of Data
```assembly
vld1.32 {q14, q15}, [r9, :128] @ Load data[6]
vld1.32 {q12, q13}, [r7, :128] @ Load data[4]
vld1.32 {q2, q3}, [r12, :128]! @ Load next twiddle factors
```

### Complex Multiplication and Butterfly
Similar pattern as the first stage, but operating on data[4], data[5], data[6], and data[7].

```assembly
@ Complex multiplications
vmul.f32 q1, q14, q2 @ data[6] * W
vmul.f32 q0, q14, q3
vmul.f32 q14, q15, q3
vmul.f32 q4, q15, q2
@ ... more multiplications for data[4] and data[5]

@ Butterfly operations and combinations
vadd.f32 q14, q14, q1
vsub.f32 q13, q4, q0
@ ... more butterfly operations

@ Load data[0]
vld1.32 {q8, q9}, [r3, :128]

@ Final butterfly combinations
vadd.f32 q11, q8, q15 @ data[0] + result
vsub.f32 q8, q8, q15 @ data[0] - result
```

## Third Butterfly Stage (Lines 175-199)

### Final Stage Processing
```assembly
@ Load remaining data from updated locations
vld1.32 {q8, q9}, [r4, :128] @ Reload data[1]
vld1.32 {q10, q11}, [r6, :128] @ Reload data[3]

@ Final butterfly operations
vadd.f32 q0, q8, q13 @ Final combinations
vadd.f32 q1, q9, q12
vsub.f32 q2, q10, q15
vadd.f32 q3, q11, q14
vsub.f32 q4, q8, q13
vsub.f32 q5, q9, q12
vadd.f32 q6, q10, q15
vsub.f32 q7, q11, q14
```

### Store Final Results with Auto-increment
```assembly
vst1.32 {q0, q1}, [r3, :128]! @ Store to data[0], increment r3
vst1.32 {q2, q3}, [r5, :128]! @ Store to data[2], increment r5
vst1.32 {q4, q5}, [r7, :128]! @ Store to data[4], increment r7
vst1.32 {q6, q7}, [r9, :128]! @ Store to data[6], increment r9
vst1.32 {q0, q1}, [r4, :128]! @ Store to data[1], increment r4
vst1.32 {q2, q3}, [r6, :128]! @ Store to data[3], increment r6
vst1.32 {q4, q5}, [r8, :128]! @ Store to data[5], increment r8
vst1.32 {q6, q7}, [r10, :128]! @ Store to data[7], increment r10
```

## Loop Control
```assembly
bne 1b @ Branch if r11 != 0
bx lr @ Return
```

## Key Observations for ARM64 Porting

### 1. **Register Mapping**
- ARM32 uses r0-r12 general purpose registers
- ARM64 will use x0-x12 (64-bit) or w0-w12 (32-bit)
- NEON registers q0-q15 in ARM32 map to v0-v15 in ARM64

### 2. **Instruction Differences**
- `vld1.32` → `ld1` with appropriate type specifier
- `vst1.32` → `st1` with appropriate type specifier
- `vmul.f32` → `fmul`
- `vadd.f32` → `fadd`
- `vsub.f32` → `fsub`

### 3. **Addressing Modes**
- ARM32: `[r0, :128]!` (128-bit aligned, post-increment)
- ARM64: Similar syntax but with x registers

### 4. **Complex Number Storage**
- Real and imaginary parts are interleaved
- Each q register holds 2 complex numbers
- Butterfly operations maintain this interleaved format

### 5. **Algorithm Structure**
The function implements a radix-8 FFT using three stages of radix-2 butterflies:
1. First stage: Process data[2,3] with data[0,1]
2. Second stage: Process data[4,5,6,7] with previous results
3. Third stage: Final combinations and output

### 6. **Memory Access Pattern**
- Strided access pattern based on input stride parameter
- Auto-increment addressing used for efficiency
- 128-bit aligned loads/stores for optimal performance

### 7. **Twiddle Factor Application**
- Twiddle factors are pre-computed and stored in LUT
- Complex multiplication implemented using 4 real multiplications
- Results combined using addition/subtraction

## Critical Path Analysis
The function has multiple data dependencies between stages, but within each stage, many operations can execute in parallel. The critical path involves:
1. Load → Complex multiply → Butterfly → Store
2. Inter-stage dependencies require careful scheduling

## Performance Considerations
- Uses 128-bit NEON registers efficiently
- Minimizes memory accesses through register reuse
- Processes multiple complex numbers simultaneously
- Loop unrolling would be beneficial for larger transforms
141 changes: 141 additions & 0 deletions neon_x8_porting_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# ARM32 NEON neon_x8 to ARM64 Porting Summary

## Executive Summary

The `neon_x8` function implements an 8-point FFT using ARM32 NEON SIMD instructions. It's a highly optimized in-place algorithm that processes complex numbers stored in an interleaved format (real, imaginary pairs). The function uses a radix-8 decomposition implemented as three stages of radix-2 butterflies.

## Key Algorithm Characteristics

### 1. **FFT Type**
- 8-point Decimation-In-Time (DIT) FFT
- In-place computation
- Complex-to-complex transform
- Three butterfly stages

### 2. **Data Layout**
- Complex numbers stored as interleaved real/imaginary pairs
- Each NEON q register holds 2 complex numbers (4 floats)
- Data accessed with stride to support larger transforms

### 3. **Computational Pattern**
```
Stage 1: Butterflies on (data[2], data[3]) with (data[0], data[1])
Stage 2: Butterflies on (data[4], data[5], data[6], data[7])
Stage 3: Final combinations of all 8 points
```

## Register Allocation Strategy

### NEON Registers (q0-q15)
- **q0-q1**: Temporary computation registers
- **q2-q3**: Twiddle factors from LUT
- **q4-q7**: Butterfly results for storage
- **q8-q15**: Data values and intermediate results

### ARM Registers (r0-r12)
- **r0-r2**: Function parameters (data, stride, LUT)
- **r3-r10**: Pointers to 8 data elements
- **r11**: Loop counter
- **r12**: Current LUT pointer

## Critical Implementation Details

### 1. **Loop Structure**
- Loop count = -(stride >> 5), increments to zero
- Processes multiple 8-point FFTs in sequence
- Auto-increment addressing for efficiency

### 2. **Complex Multiplication**
```
(a + bi) * (c + di) = (ac - bd) + (ad + bc)i
```
Implemented using 4 real multiplications per complex multiply

### 3. **Memory Access Pattern**
- Initial loads from all 8 locations
- Intermediate stores after each butterfly stage
- Pointer auto-increment prepares for next iteration

### 4. **Twiddle Factor Loading**
- Pre-computed twiddle factors in LUT
- Loaded sequentially with auto-increment
- Two q registers per twiddle set (4 complex values)

## ARM64 Porting Considerations

### 1. **Instruction Mapping**
| ARM32 | ARM64 |
|-------|-------|
| `vld1.32 {q0,q1}, [r0, :128]!` | `ld1 {v0.4s, v1.4s}, [x0], #32` |
| `vst1.32 {q0,q1}, [r0, :128]` | `st1 {v0.4s, v1.4s}, [x0]` |
| `vmul.f32 q0, q1, q2` | `fmul v0.4s, v1.4s, v2.4s` |
| `vadd.f32 q0, q1, q2` | `fadd v0.4s, v1.4s, v2.4s` |
| `vsub.f32 q0, q1, q2` | `fsub v0.4s, v1.4s, v2.4s` |

### 2. **Register Mapping**
| ARM32 | ARM64 |
|-------|-------|
| r0-r12 | x0-x12 (64-bit) or w0-w12 (32-bit) |
| q0-q15 | v0-v15 (128-bit vectors) |

### 3. **Addressing Modes**
- ARM64 supports similar aligned loads with post-increment
- Syntax: `[x0], #32` for 32-byte post-increment
- Alignment hints: `:128` becomes implicit in ARM64

### 4. **Optimization Opportunities**
- ARM64 has 32 NEON registers (v0-v31) vs 16 in ARM32
- Can reduce register pressure and memory accesses
- Potential for better instruction scheduling
- Consider using paired loads (ldp) where beneficial

## Performance Critical Paths

### 1. **Data Dependencies**
```
Load → Complex Multiply → Butterfly → Store
Next stage butterflies depend on previous results
```

### 2. **Latency Hiding**
- Early loads to hide memory latency
- Interleaved arithmetic operations
- Register reuse minimizes loads

### 3. **Throughput Optimization**
- SIMD processes 4 complex numbers per loop iteration
- Aligned 128-bit loads/stores
- Minimal memory traffic through in-place operation

## Implementation Strategy for ARM64

### 1. **Direct Translation**
- Start with 1:1 instruction mapping
- Maintain same register allocation strategy
- Preserve memory access pattern

### 2. **ARM64-Specific Optimizations**
- Utilize additional v16-v31 registers
- Consider SVE/SVE2 for scalable vectors
- Explore fused multiply-add (fmla) opportunities
- Use paired load/store where beneficial

### 3. **Testing Considerations**
- Verify bit-exact results with ARM32 version
- Test with various stride values
- Validate alignment requirements
- Performance comparison on target hardware

## Summary of Key Findings

1. **Algorithm**: Three-stage radix-2 butterfly implementation of 8-point FFT
2. **Data Format**: Interleaved complex numbers, 2 per NEON register
3. **Memory Pattern**: In-place with specific butterfly groupings
4. **Twiddle Factors**: Pre-computed, loaded from LUT
5. **Loop Structure**: Processes multiple 8-point FFTs based on stride
6. **Critical Path**: Load → Multiply → Butterfly → Store chain
7. **Register Usage**: Near-optimal use of available NEON registers
8. **Optimization**: Careful instruction scheduling for latency hiding

This analysis provides the foundation for accurate ARM64 porting while maintaining the performance characteristics of the original ARM32 implementation.
Loading