simd

A high-performance SIMD (Single Instruction, Multiple Data) library for Go providing vectorized operations on float64, float32, float16, complex128, and complex64 slices.

Features

Pure Go assembly - Native Go assembler, simple cross-compilation
Runtime CPU detection - Automatically selects optimal implementation (AVX-512, AVX+FMA, SSE4.1, NEON, NEON+FP16, or pure Go)
Zero allocations - All operations work on pre-allocated slices
80+ operations - Arithmetic, reduction, statistical, vector, signal processing, activation functions, and complex number operations
Multi-architecture - AMD64 (AVX-512/AVX+FMA/SSE4.1) and ARM64 (NEON/NEON+FP16) with pure Go fallback
Half-precision support - Native FP16 SIMD on ARM64 with FP16 extension (Apple Silicon, Cortex-A55+)
Thread-safe - All functions are safe for concurrent use

Installation

go get github.com/tphakala/simd

Requires Go 1.25+

Quick Start

package main

import (
    "fmt"
    "github.com/tphakala/simd/cpu"
    "github.com/tphakala/simd/f64"
)

func main() {
    fmt.Println("SIMD:", cpu.Info())

    // Vector operations
    a := []float64{1, 2, 3, 4, 5, 6, 7, 8}
    b := []float64{8, 7, 6, 5, 4, 3, 2, 1}

    // Dot product
    dot := f64.DotProduct(a, b)
    fmt.Println("Dot product:", dot) // 120

    // Element-wise operations
    dst := make([]float64, len(a))
    f64.Add(dst, a, b)
    fmt.Println("Sum:", dst) // [9, 9, 9, 9, 9, 9, 9, 9]

    // Statistical operations
    mean := f64.Mean(a)
    stddev := f64.StdDev(a)
    fmt.Printf("Mean: %.2f, StdDev: %.2f\n", mean, stddev)

    // Vector operations
    f64.Normalize(dst, a)
    fmt.Println("Normalized:", dst)

    // Distance calculation
    dist := f64.EuclideanDistance(a, b)
    fmt.Println("Distance:", dist)
}

Packages

`cpu` - CPU Feature Detection

import "github.com/tphakala/simd/cpu"

fmt.Println(cpu.Info())      // "AMD64 AVX-512", "AMD64 AVX+FMA", "AMD64 SSE2", or "ARM64 NEON"
fmt.Println(cpu.HasAVX())    // true/false
fmt.Println(cpu.HasAVX512()) // true/false
fmt.Println(cpu.HasNEON())   // true/false
fmt.Println(cpu.HasFP16())   // true/false (ARM64 half-precision SIMD)

`f64` - float64 Operations

Category	Function	Description	SIMD Width
Arithmetic	`Add(dst, a, b)`	Element-wise addition	8x (AVX-512) / 4x (AVX) / 2x (NEON)
	`Sub(dst, a, b)`	Element-wise subtraction	8x / 4x / 2x
	`Mul(dst, a, b)`	Element-wise multiplication	8x / 4x / 2x
	`Div(dst, a, b)`	Element-wise division	8x / 4x / 2x
	`Scale(dst, a, s)`	Multiply by scalar	8x / 4x / 2x
	`AddScalar(dst, a, s)`	Add scalar	8x / 4x / 2x
	`FMA(dst, a, b, c)`	Fused multiply-add: a*b+c	8x / 4x / 2x
	`AddScaled(dst, alpha, s)`	dst += alpha*s (axpy)	8x / 4x / 2x
Unary	`Abs(dst, a)`	Absolute value	8x / 4x / 2x
	`Neg(dst, a)`	Negation	8x / 4x / 2x
	`Sqrt(dst, a)`	Square root	8x / 4x / 2x
	`Reciprocal(dst, a)`	Reciprocal (1/x)	8x / 4x / 2x
Reduction	`DotProduct(a, b)`	Dot product	8x / 4x / 2x
	`Sum(a)`	Sum of elements	8x / 4x / 2x
	`Min(a)`	Minimum value	8x / 4x / 2x
	`Max(a)`	Maximum value	8x / 4x / 2x
	`MinIdx(a)`	Index of minimum value	Pure Go
	`MaxIdx(a)`	Index of maximum value	Pure Go
Statistical	`Mean(a)`	Arithmetic mean	8x / 4x / 2x
	`Variance(a)`	Population variance	8x / 4x / 2x
	`StdDev(a)`	Standard deviation	8x / 4x / 2x
Vector	`EuclideanDistance(a, b)`	L2 distance	8x / 4x / 2x
	`Normalize(dst, a)`	Unit vector normalization	8x / 4x / 2x
	`CumulativeSum(dst, a)`	Running sum	Sequential
Range	`Clamp(dst, a, min, max)`	Clamp to range	8x / 4x / 2x
Activation	`Sigmoid(dst, src)`	Sigmoid: 1/(1+e^-x)	4x (AVX) / 2x (NEON)
	`ReLU(dst, src)`	Rectified Linear Unit	8x / 4x / 2x
	`Tanh(dst, src)`	Hyperbolic tangent	8x / 4x / 2x
	`Exp(dst, src)`	Exponential e^x	Pure Go
	`ClampScale(dst, src, min, max, s)`	Fused clamp and scale	8x / 4x / 2x
Batch	`DotProductBatch(r, rows, v)`	Multiple dot products	8x / 4x / 2x
Signal	`ConvolveValid(dst, sig, k)`	FIR filter / convolution	8x / 4x / 2x
	`ConvolveValidMulti(dsts, sig, ks)`	Multi-kernel convolution	8x / 4x / 2x
	`AccumulateAdd(dst, src, off)`	Overlap-add: dst[off:] += src	8x / 4x / 2x
Audio	`Interleave2(dst, a, b)`	Pack stereo: [L,R,L,R,...]	4x / 2x
	`Deinterleave2(a, b, src)`	Unpack stereo to channels	4x / 2x
	`CubicInterpDot(hist,a,b,c,d,x)`	Fused cubic interp dot product	4x / 2x
	`Int32ToFloat32Scale(dst,src,s)`	PCM int32 to normalized float	8x / 4x

`f32` - float32 Operations

Same API as f64 but for float32 with wider SIMD:

Architecture	SIMD Width
AMD64 (AVX-512)	16x float32
AMD64 (AVX+FMA)	8x float32
AMD64 (SSE2)	4x float32
ARM64 (NEON)	4x float32

Additional split-format complex operations (for FFT pipelines with separate real/imag arrays):

Category	Function	Description	SIMD Width
Complex	`MulComplex(dstRe,dstIm,aRe,aIm,bRe,bIm)`	Split-format complex multiply	8x (AVX) / 4x (NEON)
	`MulConjComplex(dstRe,dstIm,aRe,aIm,bRe,bIm)`	Multiply by conjugate	8x / 4x
	`AbsSqComplex(dst,aRe,aIm)`	Magnitude squared	8x / 4x
	`ButterflyComplex(uRe,uIm,lRe,lIm,twRe,twIm)`	FFT butterfly with twiddle	8x / 4x
	`RealFFTUnpack(outRe,outIm,zRe,zIm,twRe,twIm)`	Real FFT unpack step	8x / 4x
Utility	`Reverse(dst, src)`	Reverse slice order	8x / 4x
	`AddSub(sum, diff, a, b)`	Fused sum and difference	8x / 4x

`f16` - float16 (Half-Precision) Operations

IEEE 754 half-precision floating-point operations, optimized for ML inference, audio DSP, and memory-bandwidth-bound workloads.

import "github.com/tphakala/simd/f16"

// Convert between float32 and float16
h := f16.FromFloat32(3.14)
f := f16.ToFloat32(h)

// Vector operations (same API as f32/f64)
a := make([]f16.Float16, 1024)
b := make([]f16.Float16, 1024)
dst := make([]f16.Float16, 1024)

f16.Add(dst, a, b)           // Element-wise addition
dot := f16.DotProduct(a, b)  // Dot product (returns float32)
f16.ReLU(dst, a)             // Activation functions

Category	Function	Description	SIMD Width
Conversion	`ToFloat32(h)`	FP16 → float32	Scalar
	`FromFloat32(f)`	float32 → FP16	Scalar
	`ToFloat32Slice(dst, src)`	Batch FP16 → float32	8x (NEON+FP16)
	`FromFloat32Slice(dst, src)`	Batch float32 → FP16	8x (NEON+FP16)
Arithmetic	`Add(dst, a, b)`	Element-wise addition	8x (NEON+FP16)
	`Sub(dst, a, b)`	Element-wise subtraction	8x (NEON+FP16)
	`Mul(dst, a, b)`	Element-wise multiplication	8x (NEON+FP16)
	`Div(dst, a, b)`	Element-wise division	8x (NEON+FP16)
	`Scale(dst, a, s)`	Multiply by scalar	8x (NEON+FP16)
	`AddScalar(dst, a, s)`	Add scalar	8x (NEON+FP16)
	`FMA(dst, a, b, c)`	Fused multiply-add: a*b+c	8x (NEON+FP16)
	`AddScaled(dst, alpha, s)`	dst += alpha*s (AXPY)	8x (NEON+FP16)
Unary	`Abs(dst, a)`	Absolute value	8x (NEON+FP16)
	`Neg(dst, a)`	Negation	8x (NEON+FP16)
	`Sqrt(dst, a)`	Square root	8x (NEON+FP16)
	`Reciprocal(dst, a)`	Reciprocal (1/x)	8x (NEON+FP16)
Reduction	`DotProduct(a, b)` → float32	Dot product	8x (NEON+FP16)
	`Sum(a)` → float32	Sum of elements	8x (NEON+FP16)
	`Min(a)`	Minimum value	8x (NEON+FP16)
	`Max(a)`	Maximum value	8x (NEON+FP16)
	`MinIdx(a)`	Index of minimum	Pure Go
	`MaxIdx(a)`	Index of maximum	Pure Go
Statistical	`Mean(a)` → float32	Arithmetic mean	8x (NEON+FP16)
	`Variance(a)` → float32	Population variance	Pure Go
	`StdDev(a)` → float32	Standard deviation	Pure Go
Vector	`EuclideanDistance(a, b)` → float32	L2 distance	Pure Go
	`Normalize(dst, a)`	Unit vector normalization	8x (NEON+FP16)
	`CumulativeSum(dst, a)`	Running sum	Sequential
Range	`Clamp(dst, a, min, max)`	Clamp to range	8x (NEON+FP16)
	`ClampScale(dst, src, min, max, s)`	Fused clamp and scale	Pure Go
Activation	`ReLU(dst, src)`	Rectified Linear Unit	8x (NEON+FP16)
	`Sigmoid(dst, src)`	Sigmoid: 1/(1+e^-x)	Pure Go
	`Tanh(dst, src)`	Hyperbolic tangent	Pure Go
	`Exp(dst, src)`	Exponential e^x	Pure Go
Batch	`DotProductBatch(r, rows, v)`	Multiple dot products	8x (NEON+FP16)
Signal	`ConvolveValid(dst, sig, k)`	FIR filter / convolution	Pure Go
	`AccumulateAdd(dst, src, off)`	Overlap-add: dst[off:] += src	8x (NEON+FP16)
Audio	`Interleave2(dst, a, b)`	Pack stereo: [L,R,L,R,...]	Pure Go
	`Deinterleave2(a, b, src)`	Unpack stereo to channels	Pure Go

Key characteristics:

Storage: IEEE 754 half-precision (1 sign, 5 exponent, 10 mantissa bits)
Precision: ~3.3 decimal digits, range ~6×10⁻⁸ to 65504
Reductions: Accumulate in float32 for numerical stability
Memory efficiency: 2x bandwidth vs float32 (8 elements per 128-bit NEON vector)

Hardware requirements:

Native FP16 SIMD: ARM64 with FEAT_FP16 (ARMv8.2-A+)
- Apple Silicon (M1/M2/M3/M4) ✅
- Cortex-A55, A75, A76, A77, A78, X1, X2, X3 ✅
- Raspberry Pi 5 (Cortex-A76) ✅
Pure Go fallback: All other platforms
- Raspberry Pi 3/4 (Cortex-A53/A72 - ARMv8.0) - works but no SIMD acceleration
- AMD64 - works but no SIMD acceleration

`c128` - complex128 Operations

SIMD-accelerated complex number operations for FFT-based signal processing:

Category	Function	Description	SIMD Width
Arithmetic	`Mul(dst, a, b)`	Complex multiplication	4x (AVX-512) / 2x (AVX)
	`MulConj(dst, a, b)`	Multiply by conjugate: a × conj(b)	4x / 2x
	`Scale(dst, a, s)`	Scale by complex scalar	4x / 2x
	`Add(dst, a, b)`	Complex addition	4x / 2x
	`Sub(dst, a, b)`	Complex subtraction	4x / 2x
Unary	`Abs(dst, a)`	Complex magnitude \|a + bi\|	4x (AVX-512) / 2x (AVX)
	`AbsSq(dst, a)`	Magnitude squared \|a + bi\|²	4x / 2x
	`Conj(dst, a)`	Complex conjugate: a - bi	4x / 2x

These operations are designed for FFT-based signal processing pipelines:

import "github.com/tphakala/simd/c128"

// Frequency-domain multiplication (FFT convolution)
signalFFT := make([]complex128, n)
kernelFFT := make([]complex128, n)
result := make([]complex128, n)
magnitude := make([]float64, n)

// Frequency-domain filtering
c128.Mul(result, signalFFT, kernelFFT)          // Complex multiply
c128.MulConj(result, signalFFT, kernelFFT)      // Cross-correlation

// Spectrogram and magnitude analysis
c128.Abs(magnitude, signalFFT)                  // Extract magnitude for display

Use Cases:

Abs/AbsSq: Spectrograms, power spectral density, frequency analysis
Conj: Cross-correlation, frequency-domain filtering
Mul/MulConj: FFT-based convolution, filtering, correlation

Benchmark (1024 elements, Intel i7-1260P AVX+FMA):

Operation	SIMD	Pure Go	Speedup
Mul	341 ns	757 ns	2.2x
MulConj	340 ns	749 ns	2.2x
Scale	253 ns	551 ns	2.2x
Add	86 ns	189 ns	2.2x
Abs	1326 ns	2260 ns	1.7x
AbsSq	367 ns	504 ns	1.37x
Conj	304 ns	474 ns	1.56x

`c64` - complex64 Operations

SIMD-accelerated single-precision complex number operations:

Category	Function	Description	SIMD Width
Arithmetic	`Mul(dst, a, b)`	Complex multiplication	8x (AVX-512) / 4x (AVX) / 2x (NEON)
	`MulConj(dst, a, b)`	Multiply by conjugate: a × conj(b)	8x / 4x / 2x
	`Scale(dst, a, s)`	Scale by complex scalar	8x / 4x / 2x
	`Add(dst, a, b)`	Complex addition	8x / 4x / 2x
	`Sub(dst, a, b)`	Complex subtraction	8x / 4x / 2x
Unary	`Abs(dst, a)`	Complex magnitude \|a + bi\|	8x / 4x / 2x
	`AbsSq(dst, a)`	Magnitude squared \|a + bi\|²	8x / 4x / 2x
	`Conj(dst, a)`	Complex conjugate: a - bi	8x / 4x / 2x
Conversion	`FromReal(dst, src)`	Real to complex: src → src+0i	8x / 4x / 2x

Same API as c128 but for complex64 with 2x wider SIMD (8 bytes vs 16 bytes per element):

import "github.com/tphakala/simd/c64"

// Single-precision FFT processing
signalFFT := make([]complex64, n)
kernelFFT := make([]complex64, n)
result := make([]complex64, n)
magnitude := make([]float32, n)

c64.Mul(result, signalFFT, kernelFFT)     // Complex multiply
c64.Abs(magnitude, signalFFT)              // Extract magnitude

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

Category	Operation	SIMD (ns)	Go (ns)	Speedup
Arithmetic	Add	84	446	5.3x
	Sub	84	335	4.0x
	Mul	86	436	5.1x
	Div	441	941	2.1x
	Scale	68	272	4.0x
	AddScalar	68	286	4.2x
	FMA	110	557	5.0x
Unary	Abs	66	365	5.6x
	Neg	66	306	4.6x
	Sqrt	658	1323	2.0x
	Reciprocal	447	920	2.1x
Reduction	DotProduct	162	859	5.3x
	Sum	82	184	2.3x
	Min	157	340	2.2x
	Max	154	352	2.3x
Statistical	Mean	82	184	2.3x
	Variance*	820	902	1.1x
	StdDev*	825	905	1.1x
Vector	EuclideanDistance	216	1071	5.0x
	Normalize	220	1080	4.9x
	CumulativeSum	428	425	1.0x
Range	Clamp	81	640	7.9x

*Variance/StdDev benchmarked at 4096 elements (SIMD benefits at larger sizes)

float32 Operations - SIMD vs Pure Go (1024 elements)

Category	Operation	SIMD (ns)	Go (ns)	Speedup
Arithmetic	Add	47	441	9.4x
	Sub	49	339	6.9x
	Mul	49	436	8.9x
	Div	138	655	4.8x
	Scale	40	299	7.4x
	AddScalar	39	272	7.0x
	FMA	64	444	6.9x
Unary	Abs	37	656	17.6x
	Neg	40	273	6.9x
Reduction	DotProduct	71	424	5.9x
	Sum	41	123	3.0x
	Min	65	340	5.2x
	Max	66	352	5.3x
Range	Clamp	47	701	14.8x

Activation Functions - SIMD vs Pure Go

float32 (1024 elements):

Function	SIMD (ns)	Go (ns)	Speedup	SIMD Throughput
Sigmoid	138	5906	43x	59.3 GB/s
ReLU	39	662	17x	211 GB/s
Tanh	138	28116	204x	59.5 GB/s

float64 (1024 elements):

Function	SIMD (ns)	Go (ns)	Speedup	SIMD Throughput
ReLU	68	646	9.5x	240 GB/s
Tanh	445	6230	14x	36.8 GB/s

Key Characteristics:

Tanh: 200x+ speedup for f32 - fast approximation with saturation vs math.Tanh
ReLU: Highest throughput (211-240 GB/s) - simple max(0, x) operation
Sigmoid: 43x speedup for f32 - fast approximation with exponential

Batch & Signal Processing (varied sizes)

Operation	Config	SIMD	Go	Speedup
DotProductBatch (f64)	256 vec × 100 rows	3.2 µs	20.5 µs	6.4x
DotProductBatch (f32)	256 vec × 100 rows	1.5 µs	9.8 µs	6.7x
ConvolveValid (f64)	4096 sig × 64 ker	26.6 µs	169 µs	6.3x
ConvolveValid (f32)	4096 sig × 64 ker	17.9 µs	80 µs	4.5x
ConvolveValidMulti (f64)	1000 sig × 64 ker × 2	13.4 µs	-	-
CubicInterpDot (f64)	241 taps	47 ns	88 ns	1.9x
CubicInterpDot (f32)	241 taps	21 ns	66 ns	3.1x
Int32ToFloat32Scale	1024 elements	40 ns	364 ns	9.0x
Int32ToFloat32Scale	4096 elements	153 ns	1439 ns	9.4x
Interleave2 (f64)	1000 pairs	216 ns	-	-
Deinterleave2 (f64)	1000 pairs	216 ns	-	-
Interleave2 (f32)	1000 pairs	109 ns	-	-
Deinterleave2 (f32)	1000 pairs	216 ns	-	-

Performance Summary

Package	Average Speedup	Best	Operations
f32	6.5x	21.8x (Abs)	35 functions
f64	3.2x	7.9x (Clamp)	32 functions
c128	1.77x	2.2x (Mul)	8 functions
c64	~2x	~3x (Mul)	9 functions

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

Operation	Size	Time	Throughput
DotProduct	277	151 ns	29 GB/s
DotProduct	1000	513 ns	31 GB/s
Add	1000	775 ns	31 GB/s
Mul	1000	727 ns	33 GB/s
FMA	1000	890 ns	36 GB/s
Sum	1000	635 ns	13 GB/s
Mean	1000	677 ns	12 GB/s

float32 Operations

Operation	Size	Time	Throughput
DotProduct	100	37 ns	21 GB/s
DotProduct	1000	263 ns	30 GB/s
DotProduct	10000	2.78 µs	29 GB/s
Add	1000	389 ns	31 GB/s
Mul	1000	390 ns	31 GB/s
FMA	1000	479 ns	33 GB/s

Comparison vs Pure Go

Operation	Size	SIMD	Pure Go	Speedup
DotProduct (f32)	100	37 ns	137 ns	3.7x
DotProduct (f32)	1000	262 ns	1350 ns	5.2x
DotProduct (f64)	100	62 ns	138 ns	2.2x
DotProduct (f64)	1000	513 ns	1353 ns	2.6x
Add (f32)	1000	389 ns	2015 ns	5.2x
Sum (f32)	1000	343 ns	1327 ns	3.9x

Performance Notes

AMD64: Explicit SIMD provides 5x speedups for most operations compared to pure Go, with consistent high throughput across all vector sizes.
ARM64: NEON SIMD provides substantial speedups over pure Go across all operations:
- float32: 3.7x - 5.2x faster (4 elements per 128-bit vector)
- float64: 2.2x - 2.6x faster (2 elements per 128-bit vector)
CumulativeSum is inherently sequential (each element depends on the previous) and uses pure Go on all platforms.

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

On AMD64, the Min and Max functions fall back to pure Go for small slices:

float64: slices with fewer than 4 elements
float32: slices with fewer than 8 elements

This is because AVX assembly loads multiple elements at once (4 float64s or 8 float32s), which would cause out-of-bounds memory access on smaller slices.

The Go fallback for small slices is intentional and likely optimal - SIMD setup overhead (register loading, masking, horizontal reduction) would exceed the cost of a simple 2-3 element comparison loop.

Architecture Support

Architecture	Instruction Set	f64/f32/c128/c64	f16
AMD64	AVX-512	Full SIMD support	Pure Go fallback
AMD64	AVX + FMA	Full SIMD support	Pure Go fallback
AMD64	SSE4.1	Full SIMD support	Pure Go fallback
ARM64	NEON + FP16	Full SIMD support	Full SIMD support
ARM64	NEON only	Full SIMD support	Pure Go fallback
Other	-	Pure Go fallback	Pure Go fallback

ARM64 FP16 support by device:

Device / SoC	Core(s)	Architecture	FP16 SIMD
Apple Silicon (M1-M4)	Firestorm+	ARMv8.4-A	✅ Yes
Raspberry Pi 5	Cortex-A76	ARMv8.2-A	✅ Yes
Raspberry Pi 4	Cortex-A72	ARMv8.0-A	❌ No
Raspberry Pi 3	Cortex-A53	ARMv8.0-A	❌ No
AWS Graviton 2/3	Neoverse N1/V1	ARMv8.2-A+	✅ Yes
Ampere Altra	Neoverse N1	ARMv8.2-A	✅ Yes

Design Principles

Pure Go assembly - Native Go assembler for maximum portability and easy cross-compilation
Runtime dispatch - CPU features detected once at init time, zero runtime overhead
Zero allocations - No heap allocations in hot paths
Safe defaults - Gracefully falls back to pure Go on unsupported CPUs
Boundary safe - Handles any slice length, not just SIMD-aligned sizes

Testing

The library includes comprehensive tests with pure Go reference implementations for validation:

# Run all tests
go test ./...

# Run tests with verbose output
task test

# Run benchmarks
task bench

# Compare SIMD vs pure Go performance
task bench:compare

# Show CPU SIMD capabilities
task cpu

See Taskfile.yml for all available tasks.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.githooks		.githooks
.github/workflows		.github/workflows
c128		c128
c64		c64
cpu		cpu
f16		f16
f32		f32
f64		f64
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
doc.go		doc.go
go.mod		go.mod
go.sum		go.sum

License

tphakala/simd

Folders and files

Latest commit

History

Repository files navigation

simd

Features

Installation

Quick Start

Packages

cpu - CPU Feature Detection

f64 - float64 Operations

f32 - float32 Operations

f16 - float16 (Half-Precision) Operations

c128 - complex128 Operations

c64 - complex64 Operations

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

float32 Operations - SIMD vs Pure Go (1024 elements)

Activation Functions - SIMD vs Pure Go

Batch & Signal Processing (varied sizes)

Performance Summary

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

float32 Operations

Comparison vs Pure Go

Performance Notes

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

Architecture Support

Design Principles

Testing

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Uh oh!

Languages

`cpu` - CPU Feature Detection

`f64` - float64 Operations

`f32` - float32 Operations

`f16` - float16 (Half-Precision) Operations

`c128` - complex128 Operations

`c64` - complex64 Operations