Replies: 3 comments
-
|
@MichaelClifford and I have been working on this subject for about a week, intersecting: In particular, we're pursuing a common training stack (discussed in opt. 2). Maintaining separate local training paths (opt. 1) will be super-linearly harder than maintaining a single path. The evident optimal candidate so far for this common stack is PyTorch. It's very well maintained, the most supported flavor of HF wrappers (AutoModelForCausalML, etc.), has strong backend support (as of ~ mid 2022, started supporting MPS, has supported CUDA for eons, obviously supports CPU, more esoterix stuff like FLAX I guess). The hitch so far has been limited convenient implementations of 4-bit quantization on the MPS path for Torch. BitsAndBytes is the CUDA-optimized quantization library that we were using in the early Colab/Kaggle training notebooks, but it's not available on Mac. There are various implementations of 8-bit quantization across MLX, HF, and Torch, however. MLX supports 4-bit in a cheeky way, and we could write a torch kernel to implement this behavior, but that would be a lot of work if 8-bit is even ballpark in performance / fits in most VRAM. Another hiccup is MPS profiling for Torch workflows. Our existing training flow is in MLX - the tradeoff for all the format interconversions that we do is the value that MLX math is ALWAYS done on the GPU so we don't have to worry about the Torch device kernel swapping between CPU and GPU like we might with Mac. By accepting Torch, we'd have to profile the extent to which the MPS backend is being used. @MichaelClifford and I each independently found that fp16 inference on low-ish spec MacBook Pros (m1pro, 32GB RAM in my case) was abysmal (my confident guess is swap paging latency). Though we also know that 8-bit inference via llama.cpp is pretty good, so if we didn't have to pay the memory paging overhead we might be in better shape. Current idea:Currently, thinking that we could ship an 8-bit quantized .safetensors model flavor from HF that is intended as the base format for training + service. For Validating this idea on current hardware. Strat is to 8b quantize a snapshot of merlinite-7b on CUDA machine, move over to Mac, and try qLoRA assuming regime described above. |
Beta Was this translation helpful? Give feedback.
-
|
FWIW I was able to get BitsAndBytes fork https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6 working with AMD Radeon 7900 XT, PyTorch 2.2.1+rocm5.7, and Fedora 39. On the plus side, 4-bit quantization reduced VRAM consumption from about 17 GiB to about 9.6 GiB. On the flip side, it also slowed down training by about 30% from 6 to 6.5 it/s to about 4 to 4.5 it/s. A co-worker got similar results on their ROCm setup. I conclude that BitsAndBytes is a useful option for people with less GPU memory. 7900 XT and 7900 XTX are top-of-the-line consumer cards with 20 / 24 GiB memory. Most AMD Radeon consumer cards have less memory. I do neither know how BitsAndBytes influences performance on CUDA nor do I know if there are better quantization settings for ROCm. I'm using |
Beta Was this translation helpful? Give feedback.
-
|
Update on this work: We've found that -
MLX 4-bit quantization algorithm (as we understand it)Divide a 2D parameter layer into 64 parameter batches, with an example batch called This would be the first 64 parameters in the first row of the matrix. For each batch For a given weight in the batch This quantizes a given parameter to a 4-bit representation that is quantized w.r.t the other 63 parameters in the batch. To store this quantized parameter, it is stored in the lower 4 of 32 bits of a unit32. Therefore, eight 4-bit parameters are able to be stored in a single 32-bit uint32 "cell". At inference time, the weights are dequantized using the scale and bias vectors that are saved after quantization. The dequantized matrix is used in higher precision and then discarded. Here's the C++ implementation of the algorithm: link The problem:The algorithm that MLX implements is not considering outlier parameters. Therefore, if an outlier is present in a batch it'll "flatten" the other parameters. If an outlier is clipped to a lower value to avoid flattening, the literature suggests that the model's performance degrades dramatically. In the QLoRA paper, the authors implement mixed-precision inference, keeping outliers at BF16 or alternative and quantizing non-outliers together. Therefore, using MLX's implementation as an example for a custom PyTorch layer is likely not optimal. Waiting for BitsAndBytes to support MPSThis is currently the 'ideal' path for us, because it'd unify inference to be via MPS-backend-accelerated, in the same way that the original fine-tuning notebook is CUDA-backend-accelerated with identical PyTorch and HF APIs. The CLI could then identify the backend that's available and quietly use it, instead of branching down an alternative path like we're doing now. The alternative to waitingWe could translate the BNB custom layers and CUDA kernels to Metal (MPS), but the maintainers have said that they're working on this already, so we might not want to invest in that. Further considerations
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
(mostly for long-term discussion/plan)
what does multi-platform, multi-backend support look like for the local development experience?
more ideas are welcome
cc @matthicksj
Beta Was this translation helpful? Give feedback.
All reactions