Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Releases: ashvardanian/less_slow.cpp

v0.10.8: MacOS compilation fixes 🤗 🍏

22 Apr 11:25
Compare
Choose a tag to compare
  • Docs: OpenBLAS installation on MacOS (be4a0be)
  • Fix: Missing const qualifiers in strided_ptr (9120723)
  • Fix: Can't std::format(time) on macOS (4d00aba)

Thanks to @ab-10 for spotting 🤗

Release v0.10.7

20 Apr 19:30
Compare
Choose a tag to compare

Release: v0.10.7 [skip ci]

Patch

  • Improve: Include Asm tests into macOS Arm builds (#45) (ecff6e3)

v0.10.6: Fixing aligned allocations

20 Apr 11:04
Compare
Choose a tag to compare

Thanks to @bmanga 🤗

Release v0.10.5

19 Apr 08:36
Compare
Choose a tag to compare

Release: v0.10.5 [skip ci]

Patch

Release v0.10.4

18 Apr 22:18
Compare
Choose a tag to compare

Release: v0.10.4 [skip ci]

Patch

  • Improve: Detecting CUDA availability (21dfdf3)

Release v0.10.3

18 Apr 22:13
Compare
Choose a tag to compare

Release: v0.10.3 [skip ci]

Patch

  • Docs: Cleaner stance on std::sin approximation (d4cbe85)

v0.10.2: Fast Math Patches

18 Apr 21:44
Compare
Choose a tag to compare
  • Improve: Horner method (cab8824)
  • Make: Default to -O2 (56016d5)
  • Fix: Compiling w/out Intel TBB (2346e03)
  • Docs: Typo (#39) (99a91ba)
  • Improve: Stricter range limits & fast-math (7ae2c01)
  • Make: Formatting CMake (0e3c916)
  • Improve: Detecting CUDA availability (91c5f4e)

Thanks to @corneliusroemer, @dzaima, @DrChr 🤗

Release v0.10.1

09 Apr 06:16
Compare
Choose a tag to compare

Release: v0.10.1 [skip ci]

Patch

  • Fix: Destroy CUDA events (c50e2e7)

v0.10: cuBLASLt examples for `fp8_e4m3` GEMM

27 Feb 12:56
Compare
Choose a tag to compare

DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256         12496 ns        12496 ns        56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512         13089 ns        13089 ns        53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024        14882 ns        14882 ns        46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048        25802 ns        25802 ns        26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096       109316 ns       109313 ns         6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192       821080 ns       821050 ns          629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384     7135472 ns      7135461 ns           93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO         0.00 N^3        0.00 N^3  
cublaslt_tops<fp8_e4m3_t, float>_RMS             2 %             2 % 

The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.

I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗

Release v0.9.2

23 Feb 13:36
Compare
Choose a tag to compare

Release: v0.9.2 [skip ci]

Patch

  • Docs: Counting PTX as Assembly lines (cb470dd)