Multi-platform, multiple backend support of local development #808

xukai92 · 2024-03-01T17:30:37Z

xukai92
Mar 1, 2024
Maintainer

(mostly for long-term discussion/plan)

what does multi-platform, multi-backend support look like for the local development experience?

option 1: we have specific implementation per hardware (e.g. Llama.cpp for CPUs, MLX for M chips local LoRA train/test with MLX #124, PyTorch for Nvidia GPUs, ? for AMD GPUs, etc.)
option 2: we have one implementation that has good multi-backend support (e.g. KernelAbstractions.jl via in Julia, JAX?)

more ideas are welcome

cc @matthicksj

JamesKunstle · 2024-03-11T04:36:26Z

JamesKunstle
Mar 11, 2024

@MichaelClifford and I have been working on this subject for about a week, intersecting:
#246

In particular, we're pursuing a common training stack (discussed in opt. 2). Maintaining separate local training paths (opt. 1) will be super-linearly harder than maintaining a single path.

The evident optimal candidate so far for this common stack is PyTorch. It's very well maintained, the most supported flavor of HF wrappers (AutoModelForCausalML, etc.), has strong backend support (as of ~ mid 2022, started supporting MPS, has supported CUDA for eons, obviously supports CPU, more esoterix stuff like FLAX I guess).

The hitch so far has been limited convenient implementations of 4-bit quantization on the MPS path for Torch. BitsAndBytes is the CUDA-optimized quantization library that we were using in the early Colab/Kaggle training notebooks, but it's not available on Mac. There are various implementations of 8-bit quantization across MLX, HF, and Torch, however. MLX supports 4-bit in a cheeky way, and we could write a torch kernel to implement this behavior, but that would be a lot of work if 8-bit is even ballpark in performance / fits in most VRAM.

Another hiccup is MPS profiling for Torch workflows. Our existing training flow is in MLX - the tradeoff for all the format interconversions that we do is the value that MLX math is ALWAYS done on the GPU so we don't have to worry about the Torch device kernel swapping between CPU and GPU like we might with Mac. By accepting Torch, we'd have to profile the extent to which the MPS backend is being used.

@MichaelClifford and I each independently found that fp16 inference on low-ish spec MacBook Pros (m1pro, 32GB RAM in my case) was abysmal (my confident guess is swap paging latency). Though we also know that 8-bit inference via llama.cpp is pretty good, so if we didn't have to pay the memory paging overhead we might be in better shape.

Current idea:

Currently, thinking that we could ship an 8-bit quantized .safetensors model flavor from HF that is intended as the base format for training + service. For serve + chat + generate path, it could be immediately converted to .gguf and run on llama.cpp. For train it could remain in .safetensors and be pre-quantized (by Cuda machines, hah!) so Mac's could just run MPS as usual.

Validating this idea on current hardware. Strat is to 8b quantize a snapshot of merlinite-7b on CUDA machine, move over to Mac, and try qLoRA assuming regime described above.
@xukai92

0 replies

tiran · 2024-03-12T09:24:00Z

tiran
Mar 12, 2024

FWIW I was able to get BitsAndBytes fork https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6 working with AMD Radeon 7900 XT, PyTorch 2.2.1+rocm5.7, and Fedora 39. On the plus side, 4-bit quantization reduced VRAM consumption from about 17 GiB to about 9.6 GiB. On the flip side, it also slowed down training by about 30% from 6 to 6.5 it/s to about 4 to 4.5 it/s. A co-worker got similar results on their ROCm setup.

I conclude that BitsAndBytes is a useful option for people with less GPU memory. 7900 XT and 7900 XTX are top-of-the-line consumer cards with 20 / 24 GiB memory. Most AMD Radeon consumer cards have less memory. I do neither know how BitsAndBytes influences performance on CUDA nor do I know if there are better quantization settings for ROCm. I'm using

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

0 replies

JamesKunstle · 2024-03-21T20:57:15Z

JamesKunstle
Mar 21, 2024

Update on this work:

We've found that -

MLX doesn't implement 4-bit quantization in an outlier-sensitive way link,
BitsAndBytes MPS support is on the way (timeline unknown) link which unblocks SoTA 4-bit and 8-bit quantization,

MLX 4-bit quantization algorithm (as we understand it)

Divide a 2D parameter layer into 64 parameter batches, with an example batch called $w$ from $W_0$ to $W_{63}$.

This would be the first 64 parameters in the first row of the matrix.

$b = 4$ bits to quantize to.

For each batch $w$:

$$ \alpha = max_i w_i $$

$$ \beta = min_i w_i $$

$$ s = \frac{\alpha - \beta}{ 2^b - 1}$$

For a given weight in the batch $w_i$:

$$\hat{w_i} = round \left(\frac{w_i - \beta}{s} \right)$$

This quantizes a given parameter to a 4-bit representation that is quantized w.r.t the other 63 parameters in the batch.

To store this quantized parameter, it is stored in the lower 4 of 32 bits of a unit32. Therefore, eight 4-bit parameters are able to be stored in a single 32-bit uint32 "cell".

At inference time, the weights are dequantized using the scale and bias vectors that are saved after quantization. The dequantized matrix is used in higher precision and then discarded.

Here's the C++ implementation of the algorithm: link

The problem:

The algorithm that MLX implements is not considering outlier parameters. Therefore, if an outlier is present in a batch it'll "flatten" the other parameters. If an outlier is clipped to a lower value to avoid flattening, the literature suggests that the model's performance degrades dramatically. In the QLoRA paper, the authors implement mixed-precision inference, keeping outliers at BF16 or alternative and quantizing non-outliers together. Therefore, using MLX's implementation as an example for a custom PyTorch layer is likely not optimal.

Waiting for BitsAndBytes to support MPS

This is currently the 'ideal' path for us, because it'd unify inference to be via MPS-backend-accelerated, in the same way that the original fine-tuning notebook is CUDA-backend-accelerated with identical PyTorch and HF APIs. The CLI could then identify the backend that's available and quietly use it, instead of branching down an alternative path like we're doing now.

The alternative to waiting

We could translate the BNB custom layers and CUDA kernels to Metal (MPS), but the maintainers have said that they're working on this already, so we might not want to invest in that.

Further considerations

How good is MPS support for real? Is quantization being fixed the final blocker or will MPS torch be slow?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-platform, multiple backend support of local development #808

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-platform, multiple backend support of local development #808

Uh oh!

xukai92 Mar 1, 2024 Maintainer

Replies: 3 comments

Uh oh!

Uh oh!

JamesKunstle Mar 11, 2024

Current idea:

Uh oh!

tiran Mar 12, 2024

Uh oh!

JamesKunstle Mar 21, 2024

MLX 4-bit quantization algorithm (as we understand it)

The problem:

Waiting for BitsAndBytes to support MPS

The alternative to waiting

Further considerations

xukai92
Mar 1, 2024
Maintainer

JamesKunstle
Mar 11, 2024

tiran
Mar 12, 2024

JamesKunstle
Mar 21, 2024