GGUF quantization is an umbrella term for an LLM quantization ecosystem that includes:
- GGML (tensor library for machine learning);
- llama.cpp (LLM inference engine mostly targeting CPUs);
- GGUF (binary file format for storing quantized models).
GGUF quantization implements Post-Training Quantization (PTQ): given an already-trained Llama-like model in high precision, it reduces the bit width of each individual weight. The resulting checkpoint requires less memory and thus facillitates inference on consumer-grade hardware.
GGUF was inspired by previous PTQ methods, including GPTQ, AWQ, QLoRA and QuIP#. But unlike most prior work that came out of research labs, the GGUF ecosystem was developed by the prolific open-source contributor Georgi Gerganov and a few others.
Writing docs and papers is simply not their priority, see this comment:
As the ecosystem rapidly grew over time, people are confused about the various algorithm iterations and settings.
This repository serves as unofficial documentation for the GGUF quantization ecosystem.
It written mostly manually by a human. Any sections written by AI will be clearly flagged as
Contributions are more than welcome! If you find mistakes or omissions, feel free to submit a pull request.
Just a few simple rules:
- Reliable references: PRs should be supported by reliable references (e.g. code and author comments from the official llama.cpp repository). Medium articles and Reddit threads don't qualify.
- No AI slop: We all know when something is written by AI. Please only contribute when you have a human urge for expression.