Releases: ashvardanian/less_slow.cpp
Releases · ashvardanian/less_slow.cpp
v0.10.8: MacOS compilation fixes 🤗 🍏
Release v0.10.7
v0.10.6: Fixing aligned allocations
Release v0.10.5
Release v0.10.4
Release v0.10.3
v0.10.2: Fast Math Patches
- Improve: Horner method (cab8824)
- Make: Default to
-O2
(56016d5) - Fix: Compiling w/out Intel TBB (2346e03)
- Docs: Typo (#39) (99a91ba)
- Improve: Stricter range limits &
fast-math
(7ae2c01) - Make: Formatting CMake (0e3c916)
- Improve: Detecting CUDA availability (91c5f4e)
Thanks to @corneliusroemer, @dzaima, @DrChr 🤗
Release v0.10.1
v0.10: cuBLASLt examples for `fp8_e4m3` GEMM
DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:
--------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256 12496 ns 12496 ns 56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512 13089 ns 13089 ns 53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024 14882 ns 14882 ns 46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048 25802 ns 25802 ns 26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096 109316 ns 109313 ns 6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192 821080 ns 821050 ns 629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384 7135472 ns 7135461 ns 93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO 0.00 N^3 0.00 N^3
cublaslt_tops<fp8_e4m3_t, float>_RMS 2 % 2 %
The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.
I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗