This repo has CUDA kernels for accelerating ComPosit quantisated arithmetic, and code to perform QAT on PyTorch models.
The kernels use tensor cores and half precision arithmetic, so the results and performance might be non-deterministic.
Read these:
- https://huggingface.co/blog/kernel-builder
- https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md
To make changes, you need nix. Do the multi-user installation: https://nixos.org/download/#nix-install-linux Then see the 'getting started' in https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md and do the cachix step
# Use cachix without installing it
nix run nixpkgs#cachix -- use huggingface
nix develop .#devShells.torch28-cxx11-cu128-x86_64-linux
build2cmake generate-torch build.toml
python -m venv .venv
source .venv/bin/activate
pip install --no-build-isolation -e .This should install the kernel as a package into the python in .venv Run your programs inside the shell that spawns when you ran nix develop above. If torch doesn't seem to work, run the first solution at https://danieldk.eu/Nix-CUDA-on-non-NixOS-systems#make-runopengl-driverlib-and-symlink-the-driver-library
Also see https://github.com/zeroby0/PyComposit, which is a simpler version that tries to do everything in Python