-
Notifications
You must be signed in to change notification settings - Fork 371
Non-ggml backend #31
Description
This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.
We would like to switch away from ggml at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).
Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).
Unfortunately, it doesn't support the two formats we need: f16 (original weights) and q4_0/q4_1 (quantized weights). Adding these to the ndarray backend should be viable, but getting it right and working optimally (i.e. similar to ggml's optimisations for those datatypes) will take some time.
Torch does support f16 on the GPU only, and burn's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.
So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:
- Quantize the model to standard
uint8and usendarray/torchbackends. This is the least work (at least in theory), but uint8 quantization performs worse than eitherf16orq4, from what I've heard. - Add support for
f16toburn'sndarraybackend. Thetorchbackend should already work, but it will be very hard to test with most of our machines. Adding support tondarrayfor CPU inference shouldn't be impossible either (especially if we just convert tof32for every operation), but it will be difficult to make it performance-optimal. - Add support to
q4_0/1toburn'sndarraybackend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are usingq4weights), but it has the same performance-optimality issue asf16on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume thatndarrayandrustcwill remap[[bool; 4]; N]to[u8; N/2]).
This is summarised in the following table:
uint8 |
f16 |
q4 |
|
|---|---|---|---|
ndarray |
Yes, but at noticeable quality loss | Requires semi-significant implementation work | Requires significant implementation work |
torch |
Yes, but at noticeable quality loss (GPU, CPU) | Yes, but is GPU-only | Unknown - should be possible, but likely requires custom code |
An idea that I briefly floated was porting ggml itself to Rust using c2rust and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against llama.cpp's ggml implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.
I believe the other Rust ML crates also do not support f16 or q4, but that's from a cursory exploration. Happy to be proven wrong!