Thanks to visit codestin.com
Credit goes to github.com

Skip to content

tphakala/simd

Repository files navigation

simd

Go Reference Go Report Card License: MIT

A high-performance SIMD (Single Instruction, Multiple Data) library for Go providing vectorized operations on float64, float32, float16, complex128, and complex64 slices.

Features

  • Pure Go assembly - Native Go assembler, simple cross-compilation
  • Runtime CPU detection - Automatically selects optimal implementation (AVX-512, AVX+FMA, SSE4.1, NEON, NEON+FP16, or pure Go)
  • Zero allocations - All operations work on pre-allocated slices
  • 80+ operations - Arithmetic, reduction, statistical, vector, signal processing, activation functions, and complex number operations
  • Multi-architecture - AMD64 (AVX-512/AVX+FMA/SSE4.1) and ARM64 (NEON/NEON+FP16) with pure Go fallback
  • Half-precision support - Native FP16 SIMD on ARM64 with FP16 extension (Apple Silicon, Cortex-A55+)
  • Thread-safe - All functions are safe for concurrent use

Installation

go get github.com/tphakala/simd

Requires Go 1.25+

Quick Start

package main

import (
    "fmt"
    "github.com/tphakala/simd/cpu"
    "github.com/tphakala/simd/f64"
)

func main() {
    fmt.Println("SIMD:", cpu.Info())

    // Vector operations
    a := []float64{1, 2, 3, 4, 5, 6, 7, 8}
    b := []float64{8, 7, 6, 5, 4, 3, 2, 1}

    // Dot product
    dot := f64.DotProduct(a, b)
    fmt.Println("Dot product:", dot) // 120

    // Element-wise operations
    dst := make([]float64, len(a))
    f64.Add(dst, a, b)
    fmt.Println("Sum:", dst) // [9, 9, 9, 9, 9, 9, 9, 9]

    // Statistical operations
    mean := f64.Mean(a)
    stddev := f64.StdDev(a)
    fmt.Printf("Mean: %.2f, StdDev: %.2f\n", mean, stddev)

    // Vector operations
    f64.Normalize(dst, a)
    fmt.Println("Normalized:", dst)

    // Distance calculation
    dist := f64.EuclideanDistance(a, b)
    fmt.Println("Distance:", dist)
}

Packages

cpu - CPU Feature Detection

import "github.com/tphakala/simd/cpu"

fmt.Println(cpu.Info())      // "AMD64 AVX-512", "AMD64 AVX+FMA", "AMD64 SSE2", or "ARM64 NEON"
fmt.Println(cpu.HasAVX())    // true/false
fmt.Println(cpu.HasAVX512()) // true/false
fmt.Println(cpu.HasNEON())   // true/false
fmt.Println(cpu.HasFP16())   // true/false (ARM64 half-precision SIMD)

f64 - float64 Operations

Category Function Description SIMD Width
Arithmetic Add(dst, a, b) Element-wise addition 8x (AVX-512) / 4x (AVX) / 2x (NEON)
Sub(dst, a, b) Element-wise subtraction 8x / 4x / 2x
Mul(dst, a, b) Element-wise multiplication 8x / 4x / 2x
Div(dst, a, b) Element-wise division 8x / 4x / 2x
Scale(dst, a, s) Multiply by scalar 8x / 4x / 2x
AddScalar(dst, a, s) Add scalar 8x / 4x / 2x
FMA(dst, a, b, c) Fused multiply-add: a*b+c 8x / 4x / 2x
AddScaled(dst, alpha, s) dst += alpha*s (axpy) 8x / 4x / 2x
Unary Abs(dst, a) Absolute value 8x / 4x / 2x
Neg(dst, a) Negation 8x / 4x / 2x
Sqrt(dst, a) Square root 8x / 4x / 2x
Reciprocal(dst, a) Reciprocal (1/x) 8x / 4x / 2x
Reduction DotProduct(a, b) Dot product 8x / 4x / 2x
Sum(a) Sum of elements 8x / 4x / 2x
Min(a) Minimum value 8x / 4x / 2x
Max(a) Maximum value 8x / 4x / 2x
MinIdx(a) Index of minimum value Pure Go
MaxIdx(a) Index of maximum value Pure Go
Statistical Mean(a) Arithmetic mean 8x / 4x / 2x
Variance(a) Population variance 8x / 4x / 2x
StdDev(a) Standard deviation 8x / 4x / 2x
Vector EuclideanDistance(a, b) L2 distance 8x / 4x / 2x
Normalize(dst, a) Unit vector normalization 8x / 4x / 2x
CumulativeSum(dst, a) Running sum Sequential
Range Clamp(dst, a, min, max) Clamp to range 8x / 4x / 2x
Activation Sigmoid(dst, src) Sigmoid: 1/(1+e^-x) 4x (AVX) / 2x (NEON)
ReLU(dst, src) Rectified Linear Unit 8x / 4x / 2x
Tanh(dst, src) Hyperbolic tangent 8x / 4x / 2x
Exp(dst, src) Exponential e^x Pure Go
ClampScale(dst, src, min, max, s) Fused clamp and scale 8x / 4x / 2x
Batch DotProductBatch(r, rows, v) Multiple dot products 8x / 4x / 2x
Signal ConvolveValid(dst, sig, k) FIR filter / convolution 8x / 4x / 2x
ConvolveValidMulti(dsts, sig, ks) Multi-kernel convolution 8x / 4x / 2x
AccumulateAdd(dst, src, off) Overlap-add: dst[off:] += src 8x / 4x / 2x
Audio Interleave2(dst, a, b) Pack stereo: [L,R,L,R,...] 4x / 2x
Deinterleave2(a, b, src) Unpack stereo to channels 4x / 2x
CubicInterpDot(hist,a,b,c,d,x) Fused cubic interp dot product 4x / 2x
Int32ToFloat32Scale(dst,src,s) PCM int32 to normalized float 8x / 4x

f32 - float32 Operations

Same API as f64 but for float32 with wider SIMD:

Architecture SIMD Width
AMD64 (AVX-512) 16x float32
AMD64 (AVX+FMA) 8x float32
AMD64 (SSE2) 4x float32
ARM64 (NEON) 4x float32

Additional split-format complex operations (for FFT pipelines with separate real/imag arrays):

Category Function Description SIMD Width
Complex MulComplex(dstRe,dstIm,aRe,aIm,bRe,bIm) Split-format complex multiply 8x (AVX) / 4x (NEON)
MulConjComplex(dstRe,dstIm,aRe,aIm,bRe,bIm) Multiply by conjugate 8x / 4x
AbsSqComplex(dst,aRe,aIm) Magnitude squared 8x / 4x
ButterflyComplex(uRe,uIm,lRe,lIm,twRe,twIm) FFT butterfly with twiddle 8x / 4x
RealFFTUnpack(outRe,outIm,zRe,zIm,twRe,twIm) Real FFT unpack step 8x / 4x
Utility Reverse(dst, src) Reverse slice order 8x / 4x
AddSub(sum, diff, a, b) Fused sum and difference 8x / 4x

f16 - float16 (Half-Precision) Operations

IEEE 754 half-precision floating-point operations, optimized for ML inference, audio DSP, and memory-bandwidth-bound workloads.

import "github.com/tphakala/simd/f16"

// Convert between float32 and float16
h := f16.FromFloat32(3.14)
f := f16.ToFloat32(h)

// Vector operations (same API as f32/f64)
a := make([]f16.Float16, 1024)
b := make([]f16.Float16, 1024)
dst := make([]f16.Float16, 1024)

f16.Add(dst, a, b)           // Element-wise addition
dot := f16.DotProduct(a, b)  // Dot product (returns float32)
f16.ReLU(dst, a)             // Activation functions
Category Function Description SIMD Width
Conversion ToFloat32(h) FP16 → float32 Scalar
FromFloat32(f) float32 → FP16 Scalar
ToFloat32Slice(dst, src) Batch FP16 → float32 8x (NEON+FP16)
FromFloat32Slice(dst, src) Batch float32 → FP16 8x (NEON+FP16)
Arithmetic Add(dst, a, b) Element-wise addition 8x (NEON+FP16)
Sub(dst, a, b) Element-wise subtraction 8x (NEON+FP16)
Mul(dst, a, b) Element-wise multiplication 8x (NEON+FP16)
Div(dst, a, b) Element-wise division 8x (NEON+FP16)
Scale(dst, a, s) Multiply by scalar 8x (NEON+FP16)
AddScalar(dst, a, s) Add scalar 8x (NEON+FP16)
FMA(dst, a, b, c) Fused multiply-add: a*b+c 8x (NEON+FP16)
AddScaled(dst, alpha, s) dst += alpha*s (AXPY) 8x (NEON+FP16)
Unary Abs(dst, a) Absolute value 8x (NEON+FP16)
Neg(dst, a) Negation 8x (NEON+FP16)
Sqrt(dst, a) Square root 8x (NEON+FP16)
Reciprocal(dst, a) Reciprocal (1/x) 8x (NEON+FP16)
Reduction DotProduct(a, b) → float32 Dot product 8x (NEON+FP16)
Sum(a) → float32 Sum of elements 8x (NEON+FP16)
Min(a) Minimum value 8x (NEON+FP16)
Max(a) Maximum value 8x (NEON+FP16)
MinIdx(a) Index of minimum Pure Go
MaxIdx(a) Index of maximum Pure Go
Statistical Mean(a) → float32 Arithmetic mean 8x (NEON+FP16)
Variance(a) → float32 Population variance Pure Go
StdDev(a) → float32 Standard deviation Pure Go
Vector EuclideanDistance(a, b) → float32 L2 distance Pure Go
Normalize(dst, a) Unit vector normalization 8x (NEON+FP16)
CumulativeSum(dst, a) Running sum Sequential
Range Clamp(dst, a, min, max) Clamp to range 8x (NEON+FP16)
ClampScale(dst, src, min, max, s) Fused clamp and scale Pure Go
Activation ReLU(dst, src) Rectified Linear Unit 8x (NEON+FP16)
Sigmoid(dst, src) Sigmoid: 1/(1+e^-x) Pure Go
Tanh(dst, src) Hyperbolic tangent Pure Go
Exp(dst, src) Exponential e^x Pure Go
Batch DotProductBatch(r, rows, v) Multiple dot products 8x (NEON+FP16)
Signal ConvolveValid(dst, sig, k) FIR filter / convolution Pure Go
AccumulateAdd(dst, src, off) Overlap-add: dst[off:] += src 8x (NEON+FP16)
Audio Interleave2(dst, a, b) Pack stereo: [L,R,L,R,...] Pure Go
Deinterleave2(a, b, src) Unpack stereo to channels Pure Go

Key characteristics:

  • Storage: IEEE 754 half-precision (1 sign, 5 exponent, 10 mantissa bits)
  • Precision: ~3.3 decimal digits, range ~6×10⁻⁸ to 65504
  • Reductions: Accumulate in float32 for numerical stability
  • Memory efficiency: 2x bandwidth vs float32 (8 elements per 128-bit NEON vector)

Hardware requirements:

  • Native FP16 SIMD: ARM64 with FEAT_FP16 (ARMv8.2-A+)
    • Apple Silicon (M1/M2/M3/M4) ✅
    • Cortex-A55, A75, A76, A77, A78, X1, X2, X3 ✅
    • Raspberry Pi 5 (Cortex-A76) ✅
  • Pure Go fallback: All other platforms
    • Raspberry Pi 3/4 (Cortex-A53/A72 - ARMv8.0) - works but no SIMD acceleration
    • AMD64 - works but no SIMD acceleration

c128 - complex128 Operations

SIMD-accelerated complex number operations for FFT-based signal processing:

Category Function Description SIMD Width
Arithmetic Mul(dst, a, b) Complex multiplication 4x (AVX-512) / 2x (AVX)
MulConj(dst, a, b) Multiply by conjugate: a × conj(b) 4x / 2x
Scale(dst, a, s) Scale by complex scalar 4x / 2x
Add(dst, a, b) Complex addition 4x / 2x
Sub(dst, a, b) Complex subtraction 4x / 2x
Unary Abs(dst, a) Complex magnitude |a + bi| 4x (AVX-512) / 2x (AVX)
AbsSq(dst, a) Magnitude squared |a + bi|² 4x / 2x
Conj(dst, a) Complex conjugate: a - bi 4x / 2x

These operations are designed for FFT-based signal processing pipelines:

import "github.com/tphakala/simd/c128"

// Frequency-domain multiplication (FFT convolution)
signalFFT := make([]complex128, n)
kernelFFT := make([]complex128, n)
result := make([]complex128, n)
magnitude := make([]float64, n)

// Frequency-domain filtering
c128.Mul(result, signalFFT, kernelFFT)          // Complex multiply
c128.MulConj(result, signalFFT, kernelFFT)      // Cross-correlation

// Spectrogram and magnitude analysis
c128.Abs(magnitude, signalFFT)                  // Extract magnitude for display

Use Cases:

  • Abs/AbsSq: Spectrograms, power spectral density, frequency analysis
  • Conj: Cross-correlation, frequency-domain filtering
  • Mul/MulConj: FFT-based convolution, filtering, correlation

Benchmark (1024 elements, Intel i7-1260P AVX+FMA):

Operation SIMD Pure Go Speedup
Mul 341 ns 757 ns 2.2x
MulConj 340 ns 749 ns 2.2x
Scale 253 ns 551 ns 2.2x
Add 86 ns 189 ns 2.2x
Abs 1326 ns 2260 ns 1.7x
AbsSq 367 ns 504 ns 1.37x
Conj 304 ns 474 ns 1.56x

c64 - complex64 Operations

SIMD-accelerated single-precision complex number operations:

Category Function Description SIMD Width
Arithmetic Mul(dst, a, b) Complex multiplication 8x (AVX-512) / 4x (AVX) / 2x (NEON)
MulConj(dst, a, b) Multiply by conjugate: a × conj(b) 8x / 4x / 2x
Scale(dst, a, s) Scale by complex scalar 8x / 4x / 2x
Add(dst, a, b) Complex addition 8x / 4x / 2x
Sub(dst, a, b) Complex subtraction 8x / 4x / 2x
Unary Abs(dst, a) Complex magnitude |a + bi| 8x / 4x / 2x
AbsSq(dst, a) Magnitude squared |a + bi|² 8x / 4x / 2x
Conj(dst, a) Complex conjugate: a - bi 8x / 4x / 2x
Conversion FromReal(dst, src) Real to complex: src → src+0i 8x / 4x / 2x

Same API as c128 but for complex64 with 2x wider SIMD (8 bytes vs 16 bytes per element):

import "github.com/tphakala/simd/c64"

// Single-precision FFT processing
signalFFT := make([]complex64, n)
kernelFFT := make([]complex64, n)
result := make([]complex64, n)
magnitude := make([]float32, n)

c64.Mul(result, signalFFT, kernelFFT)     // Complex multiply
c64.Abs(magnitude, signalFFT)              // Extract magnitude

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

Category Operation SIMD (ns) Go (ns) Speedup
Arithmetic Add 84 446 5.3x
Sub 84 335 4.0x
Mul 86 436 5.1x
Div 441 941 2.1x
Scale 68 272 4.0x
AddScalar 68 286 4.2x
FMA 110 557 5.0x
Unary Abs 66 365 5.6x
Neg 66 306 4.6x
Sqrt 658 1323 2.0x
Reciprocal 447 920 2.1x
Reduction DotProduct 162 859 5.3x
Sum 82 184 2.3x
Min 157 340 2.2x
Max 154 352 2.3x
Statistical Mean 82 184 2.3x
Variance* 820 902 1.1x
StdDev* 825 905 1.1x
Vector EuclideanDistance 216 1071 5.0x
Normalize 220 1080 4.9x
CumulativeSum 428 425 1.0x
Range Clamp 81 640 7.9x

*Variance/StdDev benchmarked at 4096 elements (SIMD benefits at larger sizes)

float32 Operations - SIMD vs Pure Go (1024 elements)

Category Operation SIMD (ns) Go (ns) Speedup
Arithmetic Add 47 441 9.4x
Sub 49 339 6.9x
Mul 49 436 8.9x
Div 138 655 4.8x
Scale 40 299 7.4x
AddScalar 39 272 7.0x
FMA 64 444 6.9x
Unary Abs 37 656 17.6x
Neg 40 273 6.9x
Reduction DotProduct 71 424 5.9x
Sum 41 123 3.0x
Min 65 340 5.2x
Max 66 352 5.3x
Range Clamp 47 701 14.8x

Activation Functions - SIMD vs Pure Go

float32 (1024 elements):

Function SIMD (ns) Go (ns) Speedup SIMD Throughput
Sigmoid 138 5906 43x 59.3 GB/s
ReLU 39 662 17x 211 GB/s
Tanh 138 28116 204x 59.5 GB/s

float64 (1024 elements):

Function SIMD (ns) Go (ns) Speedup SIMD Throughput
ReLU 68 646 9.5x 240 GB/s
Tanh 445 6230 14x 36.8 GB/s

Key Characteristics:

  • Tanh: 200x+ speedup for f32 - fast approximation with saturation vs math.Tanh
  • ReLU: Highest throughput (211-240 GB/s) - simple max(0, x) operation
  • Sigmoid: 43x speedup for f32 - fast approximation with exponential

Batch & Signal Processing (varied sizes)

Operation Config SIMD Go Speedup
DotProductBatch (f64) 256 vec × 100 rows 3.2 µs 20.5 µs 6.4x
DotProductBatch (f32) 256 vec × 100 rows 1.5 µs 9.8 µs 6.7x
ConvolveValid (f64) 4096 sig × 64 ker 26.6 µs 169 µs 6.3x
ConvolveValid (f32) 4096 sig × 64 ker 17.9 µs 80 µs 4.5x
ConvolveValidMulti (f64) 1000 sig × 64 ker × 2 13.4 µs - -
CubicInterpDot (f64) 241 taps 47 ns 88 ns 1.9x
CubicInterpDot (f32) 241 taps 21 ns 66 ns 3.1x
Int32ToFloat32Scale 1024 elements 40 ns 364 ns 9.0x
Int32ToFloat32Scale 4096 elements 153 ns 1439 ns 9.4x
Interleave2 (f64) 1000 pairs 216 ns - -
Deinterleave2 (f64) 1000 pairs 216 ns - -
Interleave2 (f32) 1000 pairs 109 ns - -
Deinterleave2 (f32) 1000 pairs 216 ns - -

Performance Summary

Package Average Speedup Best Operations
f32 6.5x 21.8x (Abs) 35 functions
f64 3.2x 7.9x (Clamp) 32 functions
c128 1.77x 2.2x (Mul) 8 functions
c64 ~2x ~3x (Mul) 9 functions

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

Operation Size Time Throughput
DotProduct 277 151 ns 29 GB/s
DotProduct 1000 513 ns 31 GB/s
Add 1000 775 ns 31 GB/s
Mul 1000 727 ns 33 GB/s
FMA 1000 890 ns 36 GB/s
Sum 1000 635 ns 13 GB/s
Mean 1000 677 ns 12 GB/s

float32 Operations

Operation Size Time Throughput
DotProduct 100 37 ns 21 GB/s
DotProduct 1000 263 ns 30 GB/s
DotProduct 10000 2.78 µs 29 GB/s
Add 1000 389 ns 31 GB/s
Mul 1000 390 ns 31 GB/s
FMA 1000 479 ns 33 GB/s

Comparison vs Pure Go

Operation Size SIMD Pure Go Speedup
DotProduct (f32) 100 37 ns 137 ns 3.7x
DotProduct (f32) 1000 262 ns 1350 ns 5.2x
DotProduct (f64) 100 62 ns 138 ns 2.2x
DotProduct (f64) 1000 513 ns 1353 ns 2.6x
Add (f32) 1000 389 ns 2015 ns 5.2x
Sum (f32) 1000 343 ns 1327 ns 3.9x

Performance Notes

  • AMD64: Explicit SIMD provides 5x speedups for most operations compared to pure Go, with consistent high throughput across all vector sizes.

  • ARM64: NEON SIMD provides substantial speedups over pure Go across all operations:

    • float32: 3.7x - 5.2x faster (4 elements per 128-bit vector)
    • float64: 2.2x - 2.6x faster (2 elements per 128-bit vector)
  • CumulativeSum is inherently sequential (each element depends on the previous) and uses pure Go on all platforms.

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

On AMD64, the Min and Max functions fall back to pure Go for small slices:

  • float64: slices with fewer than 4 elements
  • float32: slices with fewer than 8 elements

This is because AVX assembly loads multiple elements at once (4 float64s or 8 float32s), which would cause out-of-bounds memory access on smaller slices.

The Go fallback for small slices is intentional and likely optimal - SIMD setup overhead (register loading, masking, horizontal reduction) would exceed the cost of a simple 2-3 element comparison loop.

Architecture Support

Architecture Instruction Set f64/f32/c128/c64 f16
AMD64 AVX-512 Full SIMD support Pure Go fallback
AMD64 AVX + FMA Full SIMD support Pure Go fallback
AMD64 SSE4.1 Full SIMD support Pure Go fallback
ARM64 NEON + FP16 Full SIMD support Full SIMD support
ARM64 NEON only Full SIMD support Pure Go fallback
Other - Pure Go fallback Pure Go fallback

ARM64 FP16 support by device:

Device / SoC Core(s) Architecture FP16 SIMD
Apple Silicon (M1-M4) Firestorm+ ARMv8.4-A ✅ Yes
Raspberry Pi 5 Cortex-A76 ARMv8.2-A ✅ Yes
Raspberry Pi 4 Cortex-A72 ARMv8.0-A ❌ No
Raspberry Pi 3 Cortex-A53 ARMv8.0-A ❌ No
AWS Graviton 2/3 Neoverse N1/V1 ARMv8.2-A+ ✅ Yes
Ampere Altra Neoverse N1 ARMv8.2-A ✅ Yes

Design Principles

  1. Pure Go assembly - Native Go assembler for maximum portability and easy cross-compilation
  2. Runtime dispatch - CPU features detected once at init time, zero runtime overhead
  3. Zero allocations - No heap allocations in hot paths
  4. Safe defaults - Gracefully falls back to pure Go on unsupported CPUs
  5. Boundary safe - Handles any slice length, not just SIMD-aligned sizes

Testing

The library includes comprehensive tests with pure Go reference implementations for validation:

# Run all tests
go test ./...

# Run tests with verbose output
task test

# Run benchmarks
task bench

# Compare SIMD vs pure Go performance
task bench:compare

# Show CPU SIMD capabilities
task cpu

See Taskfile.yml for all available tasks.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

High-performance SIMD library for Go with AVX/NEON support

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 2

  •  
  •