Codestin Search App

Release: v0.10.9 [skip ci]

Patch

Make: USE_BLAS option (a27448f)

@ab-10

Docs: OpenBLAS installation on MacOS (be4a0be)
Fix: Missing const qualifiers in strided_ptr (9120723)
Fix: Can't std::format(time) on macOS (4d00aba)

Thanks to @ab-10 for spotting 🤗

Release: v0.10.7 [skip ci]

Patch

Improve: Include Asm tests into macOS Arm builds (#45) (ecff6e3)

@bmanga

Docs: Notes on #pragma regions (ab7bf3f)
Fix: Aligned allocation (#42) (a66cfe2)

Thanks to @bmanga 🤗

Release: v0.10.5 [skip ci]

Patch

Docs: Intro (60e18d8)

Release: v0.10.4 [skip ci]

Patch

Improve: Detecting CUDA availability (21dfdf3)

Release: v0.10.3 [skip ci]

Patch

Docs: Cleaner stance on std::sin approximation (d4cbe85)

@corneliusroemer

Improve: Horner method (cab8824)
Make: Default to -O2 (56016d5)
Fix: Compiling w/out Intel TBB (2346e03)
Docs: Typo (#39) (99a91ba)
Improve: Stricter range limits & fast-math (7ae2c01)
Make: Formatting CMake (0e3c916)
Improve: Detecting CUDA availability (91c5f4e)

Thanks to @corneliusroemer, @dzaima, @DrChr 🤗

Release: v0.10.1 [skip ci]

Patch

Fix: Destroy CUDA events (c50e2e7)

DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256         12496 ns        12496 ns        56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512         13089 ns        13089 ns        53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024        14882 ns        14882 ns        46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048        25802 ns        25802 ns        26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096       109316 ns       109313 ns         6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192       821080 ns       821050 ns          629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384     7135472 ns      7135461 ns           93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO         0.00 N^3        0.00 N^3  
cublaslt_tops<fp8_e4m3_t, float>_RMS             2 %             2 %

The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.

I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Patch

Uh oh!

Contributors

Uh oh!

Patch

Uh oh!

Contributors

Uh oh!

Patch

Uh oh!

Patch

Uh oh!

Patch

Uh oh!

Contributors

Uh oh!

Patch

Uh oh!

Uh oh!

Releases: ashvardanian/less_slow.cpp

Release v0.10.9

Patch

Uh oh!

v0.10.8: MacOS compilation fixes 🤗 🍏

Contributors

Uh oh!

Release v0.10.7

Patch

Uh oh!

v0.10.6: Fixing aligned allocations

Contributors

Uh oh!

Release v0.10.5

Patch

Uh oh!

Release v0.10.4

Patch

Uh oh!

Release v0.10.3

Patch

Uh oh!

v0.10.2: Fast Math Patches

Contributors

Uh oh!

Release v0.10.1

Patch

Uh oh!

v0.10: cuBLASLt examples for `fp8_e4m3` GEMM

Uh oh!