Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Jun 24, 2024. It is now read-only.
This repository was archived by the owner on Jun 24, 2024. It is now read-only.

Non-ggml backend #31

@philpax

Description

@philpax

This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.

We would like to switch away from ggml at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).

Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).

Unfortunately, it doesn't support the two formats we need: f16 (original weights) and q4_0/q4_1 (quantized weights). Adding these to the ndarray backend should be viable, but getting it right and working optimally (i.e. similar to ggml's optimisations for those datatypes) will take some time.

Torch does support f16 on the GPU only, and burn's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.

So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:

  1. Quantize the model to standard uint8 and use ndarray/torch backends. This is the least work (at least in theory), but uint8 quantization performs worse than either f16 or q4, from what I've heard.
  2. Add support for f16 to burn's ndarray backend. The torch backend should already work, but it will be very hard to test with most of our machines. Adding support to ndarray for CPU inference shouldn't be impossible either (especially if we just convert to f32 for every operation), but it will be difficult to make it performance-optimal.
  3. Add support to q4_0/1 to burn's ndarray backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are using q4 weights), but it has the same performance-optimality issue as f16 on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume that ndarray and rustc will remap [[bool; 4]; N] to [u8; N/2]).

This is summarised in the following table:

uint8 f16 q4
ndarray Yes, but at noticeable quality loss Requires semi-significant implementation work Requires significant implementation work
torch Yes, but at noticeable quality loss (GPU, CPU) Yes, but is GPU-only Unknown - should be possible, but likely requires custom code

An idea that I briefly floated was porting ggml itself to Rust using c2rust and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against llama.cpp's ggml implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.


I believe the other Rust ML crates also do not support f16 or q4, but that's from a cursory exploration. Happy to be proven wrong!

Metadata

Metadata

Assignees

No one assigned

    Labels

    issue:enhancementNew feature or requestmeta:help-wantedExtra attention is neededtopic:backend-supportSupport for alternate non-GGML backends, or for particular GGML backend features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions