Closed
Description
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do as an enhancement.
Motivation
It sounds like it's a fast/useful quantisation method:
- https://towardsdatascience.com/exllamav2-the-fastest-library-to-run-llms-32aeda294d26
- https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
- https://huggingface.co/blog/gptq-integration
- https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
-
A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.
-
Possible Implementation
N/A