Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
389 changes: 389 additions & 0 deletions neon_oe_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,389 @@
# Detailed Analysis of ARM32 NEON `neon_oe` Function

## Overview

The `neon_oe` function in `src/neon.s` is a highly optimized ARM32 NEON assembly implementation for performing FFT (Fast Fourier Transform) butterfly operations. The function name likely stands for "odd-even" processing, which is a common pattern in FFT algorithms.

## Function Entry Assumptions

### Register Setup (Lines 543-548)
```assembly
@ assumes r0 = out
@ r12 = offsets
@ r3-r10 = data pointers
@ r11 = addr of twiddle
@ r2 & lr = temps
```

**Input Registers:**
- `r0`: Output buffer base address
- `r12`: Pointer to offset array (used to calculate output addresses)
- `r3`: Data pointer 1 (input data)
- `r4`: Data pointer 2 (input data)
- `r5`: Data pointer 3 (input data)
- `r6`: Data pointer 4 (input data)
- `r7`: Data pointer 5 (input data)
- `r8`: Data pointer 6 (input data)
- `r9`: Data pointer 7 (input data)
- `r10`: Data pointer 8 (input data)
- `r11`: Pointer to twiddle factor table
- `r2`, `lr`: Temporary registers (will be overwritten)

## Instruction-by-Instruction Analysis

### Phase 1: Initial Data Loading (Lines 557-561)

```assembly
557: vld1.32 {q8}, [r5, :128]!
558: vld1.32 {q10}, [r6, :128]!
559: vld2.32 {q11}, [r4, :128]!
560: vld2.32 {q13}, [r3, :128]!
561: vld2.32 {q15}, [r10, :128]!
```

**Analysis:**
- `vld1.32 {q8}, [r5, :128]!`: Loads 4 consecutive 32-bit floats from address in r5 into q8 (d16-d17). The `:128` indicates 128-bit alignment. The `!` updates r5 by 16 bytes.
- `vld1.32 {q10}, [r6, :128]!`: Loads 4 consecutive floats from r6 into q10 (d20-d21).
- `vld2.32 {q11}, [r4, :128]!`: **Deinterleaving load** - loads 8 floats from r4, putting even-indexed elements (0,2,4,6) into d22 and odd-indexed elements (1,3,5,7) into d23. This separates real and imaginary parts of complex numbers.
- `vld2.32 {q13}, [r3, :128]!`: Deinterleaving load from r3 into q13 (d26-d27).
- `vld2.32 {q15}, [r10, :128]!`: Deinterleaving load from r10 into q15 (d30-d31).

**Data Organization After Loading:**
- q8 (d16-d17): 4 consecutive values from r5
- q10 (d20-d21): 4 consecutive values from r6
- q11: d22 = real parts, d23 = imaginary parts from r4
- q13: d26 = real parts, d27 = imaginary parts from r3
- q15: d30 = real parts, d31 = imaginary parts from r10

### Phase 2: Register Reorganization (Lines 562-564)

```assembly
562: vorr d25, d17, d17
563: vorr d24, d20, d20
564: vorr d20, d16, d16
```

**Analysis:**
These are copy operations using bitwise OR with itself:
- `d25 = d17` (copy upper half of q8)
- `d24 = d20` (copy lower half of q10)
- `d20 = d16` (copy lower half of q8)

This creates q12 with d24-d25 containing data from q8 and q10.

### Phase 3: First Butterfly Computations (Lines 565-566)

```assembly
565: vsub.f32 q9, q13, q11
566: vadd.f32 q11, q13, q11
```

**Butterfly operation on complex data:**
- `q9 = q13 - q11`: Difference (complex subtraction)
- `q11 = q13 + q11`: Sum (complex addition)

Since q11 and q13 were loaded with vld2, their real and imaginary parts are already separated:
- d18 = real(q13) - real(q11)
- d19 = imag(q13) - imag(q11)
- d22 = real(q13) + real(q11)
- d23 = imag(q13) + imag(q11)

### Phase 4: Output Address Calculation (Lines 567-573)

```assembly
567: ldr r2, [r12], #4
568: vtrn.32 d24, d25
569: ldr lr, [r12], #4
570: vtrn.32 d20, d21
571: add r2, r0, r2, lsl #2
572: vsub.f32 q8, q10, q12
573: add lr, r0, lr, lsl #2
```

**Analysis:**
- Load offset values from r12 and calculate output addresses
- `vtrn.32` performs 32-bit element transposition (interleaving/deinterleaving)
- Continue butterfly: `q8 = q10 - q12`

### Phase 5: Complete First Set of Butterflies (Lines 574-580)

```assembly
574: vadd.f32 q10, q10, q12
575: vadd.f32 q0, q11, q10
576: vadd.f32 d25, d19, d16
577: vsub.f32 d27, d19, d16
578: vsub.f32 q1, q11, q10
579: vsub.f32 d24, d18, d17
580: vadd.f32 d26, d18, d17
```

Complex butterfly results:
- `q10 = q10 + q12`: Complete the butterfly
- `q0 = q11 + q10`: Second stage butterfly sum
- `q1 = q11 - q10`: Second stage butterfly difference
- Individual lane operations on d24-d27 for real/imaginary handling

### Phase 6: Data Transposition and First Store (Lines 581-585)

```assembly
581: vtrn.32 q0, q12
582: vtrn.32 q1, q13
583: vld1.32 {d24, d25}, [r11, :128]
584: vswp d1, d2
585: vst1.32 {q0, q1}, [r2, :128]!
```

- Transpose results for proper output format
- Load twiddle factors from r11 into q12
- Swap operation to arrange data correctly
- Store first set of results to output

### Phase 7: Second Set of Data Loading (Lines 586-589)

```assembly
586: vld2.32 {q0}, [r9, :128]!
587: vadd.f32 q1, q0, q15
588: vld2.32 {q13}, [r8, :128]!
589: vld2.32 {q14}, [r7, :128]!
```

Load more complex data using deinterleaving loads.

### Phase 8: Second Set of Butterflies (Lines 590-602)

```assembly
590: vsub.f32 q15, q0, q15
591: vsub.f32 q0, q14, q13
592: vadd.f32 q3, q14, q13
593: vadd.f32 q2, q3, q1
594: vadd.f32 d29, d1, d30
595: vsub.f32 d27, d1, d30
596: vsub.f32 q3, q3, q1
597: vsub.f32 d28, d0, d31
598: vadd.f32 d26, d0, d31
599: vtrn.32 q2, q14
600: vtrn.32 q3, q13
601: vswp d5, d6
602: vst1.32 {q2, q3}, [r2, :128]!
```

Similar butterfly operations on the second set of data, followed by transposition and storage.

### Phase 9: Twiddle Factor Multiplication (Lines 603-616)

```assembly
603: vtrn.32 q11, q9
604: vtrn.32 q10, q8
605: vmul.f32 d20, d18, d25
606: vmul.f32 d22, d19, d24
607: vmul.f32 d21, d19, d25
608: vmul.f32 d18, d18, d24
609: vmul.f32 d19, d16, d25
610: vmul.f32 d30, d17, d24
611: vmul.f32 d23, d16, d24
612: vmul.f32 d24, d17, d25
613: vadd.f32 d17, d22, d20
614: vsub.f32 d16, d18, d21
615: vsub.f32 d21, d30, d19
616: vadd.f32 d20, d24, d23
```

**Complex multiplication with twiddle factors:**
- This implements complex multiplication: (a + bi) × (c + di) = (ac - bd) + (ad + bc)i
- The twiddle factors in q12 (d24=real, d25=imag) are multiplied with the butterfly results
- Results are properly combined for real and imaginary parts

### Phase 10: Final Butterflies and Storage (Lines 617-627)

```assembly
617: vadd.f32 q9, q8, q10
618: vsub.f32 q8, q8, q10
619: vadd.f32 q4, q14, q9
620: vsub.f32 q6, q14, q9
621: vadd.f32 d11, d27, d16
622: vsub.f32 d15, d27, d16
623: vsub.f32 d10, d26, d17
624: vadd.f32 d14, d26, d17
625: vswp d9, d10
626: vswp d13, d14
627: vstmia lr!, {q4-q7}
```

Final butterfly operations and storage of the remaining results to the output buffer at address lr.

## Detailed Complex Arithmetic Analysis

### Complex Number Representation
- **Memory Layout**: Complex numbers are stored as interleaved pairs [Re₀, Im₀, Re₁, Im₁, ...]
- **Register Layout After vld2**:
- Even register (e.g., d22): [Re₀, Re₁, Re₂, Re₃]
- Odd register (e.g., d23): [Im₀, Im₁, Im₂, Im₃]

### Butterfly Operation Pattern
The function implements a radix-8 FFT butterfly, processing 8 input points to produce 8 output points:

1. **First Layer Butterflies** (Lines 565-566, 572-574):
- Input pairs: (q13, q11) and (q10, q12)
- Outputs: Sum and difference of each pair

2. **Second Layer Butterflies** (Lines 575-580):
- Combines results from first layer
- Creates 4-point butterfly structure

3. **Twiddle Factor Application** (Lines 605-616):
- Complex multiplication with twiddle factors
- Implements: Output = Butterfly_Result × W^k

### Data Flow Diagram
```
Input (8 complex points from r3-r10):
[r3] ─┐
├─ Butterfly ─┐
[r4] ─┘ │
├─ Butterfly ─┐
[r5] ─┐ │ │
├─ Butterfly ─┘ ├─ Twiddle ─→ Output
[r6] ─┘ │
[r7] ─┐ │
├─ Butterfly ─┐ │
[r8] ─┘ │ │
├─ Butterfly ─┘
[r9] ─┐ │
├─ Butterfly ─┘
[r10]─┘
```

## Memory Access Pattern

### Input Pointers (r3-r10)
- Each pointer accesses 16 bytes (4 complex numbers in interleaved format)
- Post-increment updates move each pointer forward by 16 bytes
- Total input: 8 × 4 = 32 complex numbers per call

### Output Storage
- Two output locations calculated from offsets in r12
- First output: 32 bytes stored at address r2
- Second output: 32 bytes stored at address lr
- Output format maintains interleaved real/imaginary layout

### Twiddle Factors (r11)
- Loaded once per function call
- Contains pre-computed sine/cosine values for FFT
- Format: [cos, sin] pairs for complex multiplication

## Performance Characteristics

### Instruction-Level Parallelism
- Multiple independent operations scheduled together
- Address calculations interleaved with SIMD operations
- Maximizes CPU pipeline utilization

### Memory Access Optimization
- 128-bit aligned loads/stores for optimal cache line usage
- Deinterleaving loads (vld2) save separate deinterleaving instructions
- Post-increment addressing reduces instruction count

### Register Pressure Management
- All 16 NEON registers (q0-q15) are utilized
- Careful scheduling minimizes register spills
- Temporary values reuse registers as soon as possible

## Register Usage Summary

### NEON Registers Modified:
- q0-q15: All used for computation and temporary storage
- Specifically:
- q8-q10: Initial data and intermediate results
- q11, q13, q15: Complex data (real/imag separated)
- q12: Twiddle factors
- q0-q7: Final results

### ARM Registers Modified:
- r2: Temporary, used for output address calculation
- lr: Temporary, used for output address calculation
- r3-r10: Updated by post-increment addressing (moved forward by 16 bytes)
- r12: Updated by post-increment (moved forward by 8 bytes)

### Preserved Registers:
- r0: Output base address (unchanged)
- r11: Twiddle factor pointer (unchanged)

## Data Flow Analysis

1. **Input Format**: Complex numbers stored as interleaved real/imaginary pairs
2. **Processing**:
- Deinterleaving loads separate real/imaginary
- Butterfly operations on separated components
- Twiddle factor multiplication
- Re-interleaving for output
3. **Output Format**: Processed complex numbers in interleaved format

## Key Optimizations

1. **Deinterleaving Loads**: vld2 instructions automatically separate real/imaginary parts
2. **Parallel Processing**: Processes multiple complex numbers simultaneously
3. **Efficient Address Updates**: Post-increment addressing modes
4. **Register Reuse**: Careful management to minimize register pressure
5. **Aligned Access**: :128 alignment hints for optimal memory access

## Considerations for ARM64 Porting

### Register Mapping
```
ARM32 → ARM64
q0-q15 → v0-v31 (doubled register count)
r0-r12,lr → x0-x30 (64-bit registers)
d0-d31 → v0.d[0], v0.d[1], etc. (different syntax)
```

### Instruction Translation
```
ARM32 → ARM64
vld1.32 {q8}, [r5]! → ld1 {v8.4s}, [x5], #16
vld2.32 {q11}, [r4]! → ld2 {v11.4s, v12.4s}, [x4], #32
vadd.f32 q0, q1, q2 → fadd v0.4s, v1.4s, v2.4s
vsub.f32 q0, q1, q2 → fsub v0.4s, v1.4s, v2.4s
vmul.f32 d0, d1, d2 → fmul v0.2s, v1.2s, v2.2s
vtrn.32 q0, q1 → trn1/trn2 or zip1/zip2
vstmia r0!, {q0-q3} → st1 {v0.4s-v3.4s}, [x0], #64
```

### Optimization Opportunities for ARM64

1. **Additional Registers**: Use v16-v31 to reduce dependencies
2. **Larger Vectors**: Consider 256-bit SVE instructions if available
3. **Fused Operations**: Use fmla/fmls for multiply-accumulate
4. **Better Scheduling**: More registers allow better instruction scheduling
5. **Predication**: SVE provides predicated operations for conditional execution

### Potential ARM64 Implementation Structure
```assembly
// Example ARM64 equivalent structure
function neon_oe_arm64:
// Load with deinterleaving
ld2 {v11.4s, v12.4s}, [x4], #32
ld2 {v13.4s, v14.4s}, [x3], #32

// Butterfly operations
fadd v22.4s, v13.4s, v11.4s
fsub v18.4s, v13.4s, v11.4s

// Twiddle multiplication using fmla
fmul v16.4s, v18.4s, v25.4s
fmla v16.4s, v19.4s, v24.4s

// Store with interleaving
st2 {v0.4s, v1.4s}, [x2], #32
```

## Summary

The `neon_oe` function is a highly optimized radix-8 FFT butterfly implementation that:
- Processes 32 complex numbers (8 groups of 4) per invocation
- Uses all available NEON registers efficiently
- Implements complex arithmetic using deinterleaved data layout
- Applies twiddle factors for FFT frequency domain transformation
- Maintains excellent memory access patterns with aligned loads/stores

The function demonstrates expert-level ARM32 NEON optimization techniques that should be carefully preserved when porting to ARM64, while taking advantage of the newer architecture's enhanced capabilities.