Releases: ashvardanian/less_slow.cpp
Float FMA vs Integer DP4A & DPX Instructions ☣️
CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including f16
and bf16
. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators and umul24
instructions for 24-bit integer multiplication. Starting with Hopper, Dynamic Programming eXtensitons (DPX) were added for combinatorial problems that can be used to implement Algebraic Graph Theory algorithms using matrix multiplications over alternative semi-rings.
How do those instructions stack up, and how much performance can we expect from recent State-of-the-Art GPUs like the Nvidia H200?
f64
FMA: 4.5 Ti64
FMA: 3.1 Tf32
FMA: 22 Ti32
FMA: 15.5 T ...so we should always prefer 32-bit opsu8u32
DP4A: 39.3 Tu24u32
UMUL: 13.4 T ...not really better thani32
FMAf16
FMA on Volta: 12.2 Tbf16
FMA on Ampere: 12.2 T- DPX for Floyd-Warshall algorithm with
u16
andu32
on Hopper: 11 T - DPX for Needleman-Wunsch algorithm with
i16
andi32
on Hopper: 11 T - DPX for Smith-Waterman algorithm with
i32
on Hopper: 27 T
Check the code and inline comments for more details!
Those goodies are now part of "StringZilla 4 CUDA" release 🥳
Minor
- Add:
dp4a
&umul24
instructions (ce1e3b7) - Add: DPX instructions on Hopper (1ab4f41)
- Add: In-register FMA benchmarks for GPUs (97991fd)