Non-`ggml` backend

This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.

We would like to switch away from `ggml` at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).

Our primary candidate for a Rust-native ML/tensor backend is [burn](https://github.com/burn-rs/burn), which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).

Unfortunately, it doesn't support the two formats we need: `f16` (original weights) and `q4_0`/`q4_1` (quantized weights). Adding these to the `ndarray` backend should be viable, but getting it right and working optimally (i.e. similar to `ggml`'s optimisations for those datatypes) will take some time.

Torch does support `f16` on the GPU only, and `burn`'s Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.

So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:

1. Quantize the model to standard `uint8` and use `ndarray`/`torch` backends. This is the least work (at least in theory), but uint8 quantization performs worse than either `f16` or `q4`, from what I've heard.
2. Add support for `f16` to `burn`'s `ndarray` backend. The `torch` backend should already work, but it will be very hard to test with most of our machines. Adding support to `ndarray` for CPU inference shouldn't be impossible either (especially if we just convert to `f32` for every operation), but it will be difficult to make it performance-optimal.
3. Add support to `q4_0/1` to `burn`'s `ndarray` backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are using `q4` weights), but it has the same performance-optimality issue as `f16` on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume that `ndarray` and `rustc` will remap `[[bool; 4]; N]` to `[u8; N/2]`).

This is summarised in the following table:

|  | `uint8` | `f16` | `q4` |
| - | - | - | - |
| `ndarray` | Yes, but at noticeable quality loss | Requires semi-significant implementation work | Requires significant implementation work |
| `torch` | Yes, but at noticeable quality loss (GPU, CPU) | Yes, but is GPU-only | Unknown - should be possible, but likely requires custom code |

---

An idea that I briefly floated was porting `ggml` itself to Rust using `c2rust` and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against `llama.cpp`'s `ggml` implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.

---

I believe the other Rust ML crates also do not support `f16` or `q4`, but that's from a cursory exploration. Happy to be proven wrong!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-`ggml` backend #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	`uint8`	`f16`	`q4`
`ndarray`	Yes, but at noticeable quality loss	Requires semi-significant implementation work	Requires significant implementation work
`torch`	Yes, but at noticeable quality loss (GPU, CPU)	Yes, but is GPU-only	Unknown - should be possible, but likely requires custom code

Non-ggml backend #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Non-`ggml` backend #31