AQLM
AQLM
Vage Egiazarian * 1 2 Andrei Panferov * 1 2 Denis Kuznedelev 2 3 Elias Frantar 4 Artem Babenko 2 Dan Alistarh 4 5
Perplexity on Wikitext2
The emergence of accurate open large language AQLM (2bit)
models (LLMs) has led to a race towards perfor- 7
Baseline (FP16)
arXiv:2401.06118v2 [cs.LG] 6 Feb 2024
1
Extreme LLM Compression via Additive Quantization
Contribution. In this work, we improve the state-of-the-art show show that our approach can match or even outper-
in LLM compression by showing for the first time that Multi- form the floating point baseline in terms of speed, while
Codebook Quantization (MCQ) techniques can be extended reducing the memory footprint by up to 8x. Specifi-
to LLM weight compression. Broadly, MCQ is a family cally, AQLM can be executed with layer-wise speedups
of information retrieval methods (Chen et al., 2010; Jegou of ∼ 30% for GPUs, and of up to 4x for CPU inference.
et al., 2010; Ge et al., 2013; Zhang et al., 2014; Babenko &
Lempitsky, 2014; Martinez et al., 2016; 2018), consisting of 2. Background & Related Work
specialized quantization algorithms to compress databases
of vectors, allowing for efficient search. Unlike direct quan- 2.1. LLM Quantization
tization, MCQ compresses multiple values jointly, by lever-
Early efforts towards post-training quantization (PTQ)
aging the mutual information of quantized values.
methods (Nagel et al., 2020; Gholami et al., 2021) that
More precisely, we extend Additive Quantization scale to LLMs such as ZeroQuant (Yao et al., 2022),
(AQ) (Babenko & Lempitsky, 2014; Martinez et al., 2016), LLM.int8() (Dettmers et al., 2022), and nuQmm (Park et al.,
a popular MCQ algorithm, to the task of compressing 2022) employed direct round-to-nearest (RTN) projections,
LLM weights such that the output of each layer and and adjusted quantization granularity to balance memory
Transformer block are approximately preserved. Our efficiency and accuracy. GPTQ (Frantar et al., 2022a) pro-
extension reformulates the classic AQ optimization problem posed a more accurate data-aware approach via an approxi-
to reduce the error in LLM layer outputs under the input mate large-scale solver for minimizing layer-wise ℓ2 errors.
token distribution and as well as to jointly optimize
Dettmers & Zettlemoyer (2022) examined the accuracy-
codes over layer blocks, rather than only preserving the
compression trade-offs of these early methods, suggesting
weights themselves as in standard AQ. We refer to the
that 4-bit quantization may be optimal for RTN quantization,
resulting procedure as Additive Quantization of Language
and observing that data-aware methods like GPTQ allow
Models (AQLM). Unlike some extreme LLM quantization
for higher compression, i.e. strictly below 4 bits/weight,
approaches that require hybrid sparse-quantized formats
maintaining Pareto optimality. Our work brings this Pareto
which separate outlier quantization (Kim et al., 2023;
frontier below 3 bits/weight, for the first time. Parallel
Dettmers et al., 2023b), AQLM quantizes models in a
work quantizing both weights and activations to 8-bits,
simple homogeneous format, which is easy to support in
by Dettmers et al. (2022), Xiao et al. (2022), and Yao et al.
practice. Our main contributions are as follows:
(2022) noted that the “outlier features” in large LLMs cause
1. We propose the AQLM algorithm, which extends AQ
substantial errors, prompting various mitigation strategies.
to post-training compression of LLM weights, via two
innovations: (1) adapting the MAP-MRF optimization Recently, several improved techniques have focused on the
problem behind AQ to be instance-aware, taking layer difficulty of quantizing weight outliers, which have high
calibration input & output activations into account; (2) impact on the output error. SpQR (Dettmers et al., 2023b)
complementing the layer-wise optimization with an addresses this by saving outliers as a highly-sparse higher-
efficient intra-layer tuning technique, which optimizes precision matrix. AWQ (Lin et al., 2023) reduces the error of
quantization parameters jointly over several layers, us- quantizing channels with the highest activation magnitudes
ing only the calibration data. by employing per-channel scaling to reduce the error on
2. We evaluate the effectiveness of this algorithm on the important weights. SqueezeLLM (Kim et al., 2023) uses the
task of compressing accurate open LLMs from the diagonal Fisher as a proxy for the Hessian and implements
L LAMA 2 (Touvron et al., 2023) family with com- non-uniform quantization through K-means clustering.
pression rates of 2-4 bits per parameter. We find that The state-of-the-art method in terms of accuracy-to-size
AQLM outperforms the previous state-of-the-art across trade-off is QuIP (Chee et al., 2023). Concurrent to our
the standard 2-4 bit compression range, with the most work, an improved variant called QuIP# (Tseng et al., 2023)
significant improvements for extreme 2-bit quantiza- was introduced. Roughly, these methods work by first
tion (see Figure 1). We provide detailed ablations for “smoothening” weights by multiplying with a rotation ma-
the impact of various algorithm parameters, such as trix, and then mapping them onto a lattice. QuIP was the
code width and number of codebooks, and extend our first method to obtain stable results (i.e., single-digit PPL
analysis to the recent Mixtral model (Jiang et al., 2024). increases) in the 2-bit per parameter compression range. At
a high level, QuIP and QuIP# aim to minimize the “worst-
3. We show that AQLM is practical, by providing efficient case” error for each layer, given initial weights and calibra-
GPU and CPU kernels implementations for specific tion data. For instance, in QuIP#, the distribution of the
encodings, as well as end-to-end generation1 . Results rotated weights approximates a Gaussian, while the encod-
1
https://github.com/vahe1994/AQLM ing lattice (E8P) is chosen to minimize “rounding” error.
2
Extreme LLM Compression via Additive Quantization
By contrast, our approach uses a different weight encoding tween database and query vectors by solving a constrained
(codebooks are additive), and learned codebooks instead optimization problem. Similarly to the formula above, this
of a fixed codebook. Thus, our insight is that we should be procedure allows for efficient inner product search by pre-
able to obtain higher accuracy by direct optimization of the computing dot products between the query q an all codes in
codebooks over the calibration set, removing the rotation. the learned codebooks, then adding these partial dot prod-
Further, we show that codebooks for different layers can ucts to recover the full similarity score.
co-train via joint fine-tuning over the calibration data.
Non-orthogonal quantizations. Follow-up work (Chen
et al., 2010; Babenko & Lempitsky, 2014; Martinez et al.,
2.2. Quantization for Nearest Neighbor Search 2016; Zhang et al., 2014; Ozan et al., 2016; Martinez et al.,
Our work builds on approximate nearest neighbor search 2018) generalized the idea of Product Quantization by ap-
(ANN) algorithms. Unlike PTQ, ANN quantization aims to proximating each vector by a sum of M codewords instead
compress a database of vectors to allow a user to efficiently of concatenation. The resulting procedure is still efficient
compute similarities and find nearest neighbors relative to while the approximation accuracy is increased.
a set of query points. For high compression, modern ANN For this, Residual Vector Quantization (Chen et al., 2010),
search algorithms employ vector quantization (VQ)—which quantizes original vectors, and then iteratively quantizes the
quantizes multiple vector dimensions jointly (Burton et al., approximation residuals from the previous iteration. Addi-
1983; Gray, 1984). It achieves this by learning “codebooks”: tive Quantization (AQ) (Babenko & Lempitsky, 2014) is
i.e. a set of learnable candidate vectors that can be used more general, as it does not impose constraints on the code-
to encode the data. To encode a given database vector, words from the different codebooks. Usually, AQ provides
VQ splits it into sub-groups of entries, then encodes every the smallest compression errors, but is more complex to
group by choosing a vector from the learned codebook. The train for large M . We discuss this in detail in Section 3.
algorithm efficiently computes distances or dot-products for
Finally, several recent works (Martinez et al., 2016; 2018;
similarity search by leveraging the linearity of dot products.
Zhang et al., 2014) elaborate the idea of Additive Quantiza-
Quantization methods for ANN search generalize vector tion, proposing the more effective procedure for codebooks
quantization and are referred to as multi-codebook quan- learning. Composite Quantization (CQ) (Zhang et al., 2014)
tization (MCQ). MCQ methods typically do not involve learns codebooks with a fixed value of scalar product be-
information loss on the query side, which makes them the tween the codewords from different codebooks. Currently,
leading approach for memory-efficient ANN (Ozan et al., the state-of-the-art compression accuracy is achieved by the
2016; Martinez et al., 2018). We briefly review MCQ below. LSQ method (Martinez et al., 2018).
Product quantization (PQ) (Jegou et al., 2010) is an early Vector quantization for model compression. There has
version of MCQ, which encodes each vector x ∈ RD as been significant work on exploiting vector quantization
D
a concatenation of M codewords from M M -dimensionalin the context of machine learning. For instance, Zhou
codebooks C1 , . . . , CM , each containing K codewords. PQ et al. (2017); Li et al. (2017); Chen et al. (2019) use
decomposes a vector into M separate subvectors and applies multi-codebook quantization to compress word embeddings
vector quantization (VQ) to each subvector, while using a within deep learning models. Another line of work (Blalock
separate codebook. Thus, each vector x is encoded by a & Guttag, 2021; McCarter & Dronen, 2022; Fernández-
tuple of codeword indices [i1 , . . . , iM ] and approximated byMarqués et al., 2023) explores vector quantization for linear
x ≈ [c1i1 , . . . , cM iM ]. Fast Euclidean distance computationmodels, or linear layers within deep models. Similarly to
becomes possible using lookup tables: PQ above, these techniques pre-compute inner products be-
M tween inputs and all codes, then compute linear layer via
X 2
∥q − x∥2 ≈ ∥q − [c1i1 , . . . , cM iM ]∥2 = ∥qm − cmim ∥ , look-up, which speeds up inference. However, these algo-
m=1 rithms introduce significant prediction error that does not
allow them to compress deep models. Thus, we believe we
where qm is the mth subvector of a query q. This sum
are the first to successfully adapt and scale MCQ to LLMs.
can be calculated using M additions and lookups if the dis-
tances from query subvectors to codewords are precomputed.
Since product-based approximations work better if the M D
- 3. AQLM: Additive Quantization for LLMs
dimensional components independent distributions, subse-
3.1. Overview
quent work has looked into finding better transformations
(Ge et al., 2013; Norouzi & Fleet, 2013). As for the other We start from the observation that additive quantization
similarity functions, (Guo et al., 2016) proposes a quantiza- (AQ) solves a related problem to post-training quantization
tion procedure for maximum inner product search (MIPS). (PTQ) (Nagel et al., 2020; Frantar et al., 2022b): both set-
They minimize quantization error in the inner products be- tings assume the existence of a set of “input” vectors, i.e.
3
Extreme LLM Compression via Additive Quantization
input data for AQ, and the weight matrix rows for PTQ. The
goal is to compress these inputs while preserving dot prod- M
!
uct similarity, against query vectors (for AQ), and against X
arg min||WX − Concati,j Cm bi,j,m X||22 . (3)
layer input embeddings (for PTQ). The difference between C,b m=1
the two is that AQ assumes that the distribution of queries
is unknown, whereas PTQ methods, e.g. (Frantar et al., To learn this weight representation, we initialize codebooks
2022b), show that it is sufficient to optimize for sample C and codes b by running residual K-means as in Chen et al.
input embeddings from a set of calibration data. (2010). Then, we alternate between updating codes bi,j,m
and codebooks Cm until the loss function (3) stops improv-
At a high level, we start by solving the following problem:
ing up to the specified tolerance. Since codes are discrete
for a linear layer with din input and dout output features
and codebooks are continuous, and we are optimizing over
given its weights W ∈ Rdout ×din and a set of calibration
multiple interacting layers, our approach has three phases,
inputs X ∈ Rdin ×n , one seeks for a configuration of quan-
described in Algorithm 1 and detailed below.
tized weights Ŵ that optimizes squared error between the
output of the original and compressed layer:
3.2. Phase 1: Beam search for codes
2
arg min||WX − WX||
c
2. (1)
First, AQLM updates the codes bi,j,m to minimize the MSE
W
c
objective (3). Similarly to Babenko & Lempitsky (2014);
In the following, we will assume that Wc is quantized using
Martinez et al. (2016; 2018), we reformulate the objective in
AQ, and adopt standard notation (Martinez et al., 2016). AQ terms of a fully-connected discrete Markov Random Field
splits weight rows into groups of g consecutive elements, (MRF) to take advantage of MRF solvers.
and represents each group of weights as a sum of M vec-
tors chosen from multiple learned codebooks C1 , ...CM , To simplify the derivation, let us first consider a special case
each containing 2B vectors (for B-bit codes). A weight is of a single output unit (dout =1) and a single quantization
encoded by choosing a single code from each codebook group (i.e.Pg=din ), to get rid of the concatenation operator:
M
and summing them up. We denote this choice as a one-hot ||WX − m=1 Cm bm X||22 . We rewrite this objective by
vector bm , which expanding the squared difference:
PM results in the following representation
for a group: m=1 Cm bijm . This is similar to PTQ algo- M
X
rithms (Frantar et al., 2022a), except for using much more ||WX − Cm bm X||22 = ||WX||22 −
complex coding per group. To represent the full weights, m=1
we simply concatenate: * M
X
+ M
X
M
X M
X − 2 WX , Cm bm X + || Cm bm X||22
W
c i= Cm bi,1,m ⊕ ... ⊕ Cm bi,din /g,m , (2) m=1 F m=1
m=1 m=1
(4)
B
where ⊕ denotes concatenation and bijm ∈ R2 represents Above, ⟨·, ·⟩F denotes a Frobenius inner product of two
a one-hot code for the i-th output unit, j-th group of input matrices. Next, let us consider the three components of
dimensions and m-th codebook. Eqn. (4) in isolation. First, note that ||WX||22 is constant
in b and can be ignored. The second component can be
expanded further into pairwise dot products:
M
X M X
X M
|| Cm bm X||22 = ⟨Ci bi X, Cj bj X⟩F . (5)
m=1 i=1 j=1
Σ
m m
m Scales def
j
group of g weights j using M · B bits and further
j requires note this type of product as ⟨A, B⟩XXT = AXXT , B F
i g · 2B · 16 mbits ifor FP16 codebooks. The error
i
becomes: in future derivations. Then, Eqn. (4) becomes:
4
Expanded Codes Unscaled Rows Weight Matrix
Extreme LLM Compression via Additive Quantization
5
Extreme LLM Compression via Additive Quantization
m
Σ
m
j Scales
j j j
i
i m i i
Figure 3: AQLM compressed weight format. Horizontal and vertical axes are input features and output units, respectively.
Depth represents the codebook index. Reconstruction procedure, from left to right: i) compressed weight codes ii) zoom-in
one weight group, each code is an index in its respective codebook iii) select codes from each codebook iv) add up codes as
in (2) v) multiply by scales (one scale per output dimension).
Algorithm 1 AQLM: Additive Quantization for LLMs models on a single GPU in reasonable time. Also, since the
Require: model, data algorithm only modifies a few trainable parameters, it uses
1: Xblock := model.input_embeddings(data) little VRAM for optimizer states. This fine-tuning converges
2: for i = 1, . . . , model.num_layers do after a few iterations, as it starts from a good initial guess.
3: block := model.get_block(i)
In practice, fine-tuning transformer layers takes a minority
4: Yblock := block(Xblock )
5: for layer ∈ linear_layers(block) do (10-30% or less) of the total calibration time.
6: W := layer.weight
7: X := layer_inputs(layer, Xblock ) 4. Experiments
8: C, b, s := initialize(W) // k-means
9: while loss improves by at least τ do We evaluate the AQLM algorithm in typical scenarios for
10: C, s := train_Cs_adam(XXT , W, C, b, s) post-training quantization of modern LLMs. Our evalua-
11: b := beam_search(XXT , W, C, b, s) tion is focused on the L LAMA 2 model family since it is a
12: end while
popular backbone for fine-tuned models or general LLM ap-
13: /* save for fine-tuning */
14: layer.weight := AQLMFormat(C, b, s) plications, e.g. (Dettmers et al., 2023a), and we also present
15: end for results on Mistral-family models (Jiang et al., 2024). In
16: θ := trainable_parameters(block) Section 4.1, we evaluate the full AQ procedure for various
17: while loss improves by at least τ do L LAMA 2 models and quantization bit-widths; Section 4.2
18: L := ||block(Xblock ) − Yblock ||22
presents an ablation analysis for individual AQ components
19: θ := adam(θ, ∂L ∂θ
)
20: end while and implementation details.
21: Yblock := block(Xblock )
22: end for 4.1. Compression quality for modern LLMs
We report perplexity on WikiText2 (Merity et al., 2016) and
attention, followed by a single and MLP layer. Having C4 (Raffel et al., 2020) validation sets. We also measure
quantized all linear layers within a single transformer block, zero-shot accuracy on WinoGrande (Sakaguchi et al., 2021),
we fine-tune its remaining parameters to better approximate PiQA (Tata & Patel, 2003), HellaSwag (Zellers et al., 2019),
the original outputs of that transformer block by backpropa- ARC-easy and ARC-challenge (Clark et al., 2018) via the
gating through the weight representation (2). LM Eval Harness (Gao et al., 2021). We broadly follow the
Concretely, we use the PyTorch autograd engine to differ- evaluation setup of GPTQ (Frantar et al., 2022a).
entiate the ||block(Xblock ) − Yblock ||2 , where Xblock are We consider three main targets in terms of compression
the inputs activations for that transformer block and Yblock ranges: 2-2.8 bits, 3-3.1 bits, and 4-4.1 bits per model pa-
are output activations of block(Xblock ) recorded prior to rameter. In the results below average bits per parameter
quantization. We train the codebooks Cm , scale vectors s takes into account only quantized weights, we do not include
and all non-quantized parameters (RMSNorm scales and parameters kept in floating precision similarly to the related
biases), while keeping the codes bi,j,m frozen. Similarly work. The details on the model size estimate are provided
to Section 3.3, we train these parameters using Adam to in Appendix G. We compare AQ against GPTQ for 3&4
minimize the MSE against the original block outputs (prior bits (Frantar et al., 2022a), SpQR for 3&4 bits (Dettmers
to quantization). This phase uses the same calibration data et al., 2023b), QuIP in 2,3 & 4 bits (Chee et al., 2023) and
as for the individual layer quantization. The full procedure QuIP# for 2&4 bits (Tseng et al., 2023). While GPTQ and
is summarized in Alg. 1. SpQR technically support 2-bit quantization, they perform
While fine-tuning blocks is more expensive than individual poorly in the 2-3 bit range. For QuIP, we omit results for the
linear layers, it is still possible to quantize billion-parameter 7B model, as we could not achieve competitive performance
in this one scenario using the available implementations.
GLU activations, group query attention and QKA weight merging. (Currently, there is no official implementation of the origi-
6
Extreme LLM Compression via Additive Quantization
Table 1: Evaluation of quantized L LAMA 2 models for 2-2.8 bits per parameter, with an extra section for higher bitwidth.
We report perplexity on WikiText2 (Merity et al., 2016) & C4 (Raffel et al., 2020) and accuracy for zero-shot tasks. The
Average accuracy is the mean of 5 zero-shot tasks. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.
Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 5.12 6.63 67.25 78.45 56.69 69.32 40.02 62.35
AQLM 2.02 6.64 8.56 64.17 73.56 49.49 61.87 33.28 56.47
7B QuIP# 2.02 8.22 11.01 62.43 71.38 42.94 55.56 28.84 52.23
AQLM 2.29 6.29 8.11 65.67 74.92 50.88 66.50 34.90 58.57
– 16 4.57 6.05 69.61 78.73 59.72 73.27 45.56 65.38
AQLM 1.97 5.65 7.51 65.43 76.22 53.74 69.78 37.8 60.59
QuIP 2.00 13.48 16.16 52.80 62.02 35.80 45.24 23.46 43.86
13B QuIP# 2.01 6.06 8.07 63.38 74.76 51.58 64.06 33.96 57.55
AQLM 2.18 5.41 7.20 68.43 76.22 54.68 69.15 39.42 61.58
AQLM 2.53 5.15 6.80 68.11 76.99 56.54 71.38 40.53 62.71
AQLM 2.76 4.94 6.54 68.98 77.58 57.71 72.90 43.60 64.15
– 16 3.12 4.97 76.95 81.07 63.99 77.74 51.11 70.17
AQLM 2.07 3.94 5.72 75.93 80.43 61.79 77.68 47.93 68.75
70B
QuIP 2.01 5.90 8.17 67.48 74.76 50.45 62.16 33.96 57.76
QuIP# 2.01 4.16 6.01 74.11 79.76 60.01 76.85 47.61 67.67
Table 2: Evaluation of quantized L LAMA 2 models for 3-3.1 bits per parameter, with the same metrics as in Table 1.
Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 5.12 6.63 67.25 78.45 56.69 69.32 40.02 62.35
AQLM 3.04 5.46 7.08 66.93 76.88 54.12 68.06 38.40 60.88
7B GPTQ 3.00 8.06 10.61 59.19 71.49 45.21 58.46 31.06 53.08
SpQR 2.98 6.20 8.20 63.54 74.81 51.85 67.42 37.71 59.07
– 16 4.57 6.05 69.61 78.73 59.72 73.27 45.56 65.38
AQLM 3.03 4.82 6.37 68.43 77.26 58.30 70.88 42.58 64.49
13B GPTQ 3.00 5.85 7.86 63.93 76.50 53.47 65.66 38.48 59.61
SpQR 2.98 5.28 7.06 67.48 77.20 56.34 69.78 39.16 61.99
QuIP 3.00 5.12 6.79 69.93 76.88 57.07 70.41 41.47 63.15
– 16 3.12 4.97 76.95 81.07 63.99 77.74 51.11 70.17
AQLM 3.01 3.36 5.17 77.19 81.28 63.23 77.61 50.00 69.86
70B GPTQ 3.00 4.40 6.26 71.82 78.40 60.00 72.73 44.11 65.41
SpQR 2.98 3.85 5.63 74.66 80.52 61.95 75.93 48.04 68.22
QuIP 3.01 3.87 5.67 74.59 79.98 60.73 73.19 46.33 66.96
nal QuIP (non-#) for the L LAMA 2 model.) For QuIP#, we PPL across standard validation sets (Wiki-Text2 and C4),
focus on 2 and 4 bit because the available implementation and accuracy across zero-shot tasks. Specifically, we ob-
does not yet support 3-bit compression. We calibrate each serve the highest accuracy gains in the “extreme” 2-2.1 bits
algorithm using the subset of RedPajama dataset (Computer, per parameter range, where the deviation from the uncom-
2023), with a sequence length of 4096. pressed model becomes large for all methods.
The exact bit-widths for each method are dictated by param-
eters such as the number of codebooks and code width. We Mixtral quantization. Table 3 presents results on the
report results for the 2−2.8 and 3−3.1 bitwidth ranges in Mixtral MoE-type model, comparing against QuIP# at 2-bits.
Tables 1 and 2, respectively. Additional results for 4 − 4.1 (See Appendix E.1 for full results.) AQLM outperforms
bits are deferred to Appendix E.2. QuIP# in this case as well. Although the margins are lower
compared to L LAMA 2 models, they are still significant for
The results show that AQLM outperforms the previous best
“harder” tasks, such as Arc Challenge (+3 points).
PTQ algorithms across all settings, often by wide margins,
especially at high compression. This holds both in terms of Pareto optimality of AQLM. The significant error improve-
7
Extreme LLM Compression via Additive Quantization
Table 3: Evaluation of quantized Mixtral (Jiang et al., 2024) models for 2 bits. The table reports perplexity on Wiki-
Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The Average accuracy
column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.
Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 3.46 5.02 75.45 82.37 64.65 83.38 55.80 72.33
AQLM 1.98 4.61 5.75 73.64 79.27 57.91 78.96 48.63 67.68
8x7B
QuIP# 2.01 4.75 5.89 71.11 79.05 58.23 77.57 45.73 66.34
ments raise the question of choosing the “optimal” model improvement, but with diminishing returns. In this respect,
variant to maximize accuracy within a certain memory bud- AQLM benefits more from larger calibration sets (similarly
get. For this, we follow Dettmers & Zettlemoyer (2022): a to QuIP#), as opposed to direct methods like GPTQ which
quantized model is said to be Pareto-optimal if it maximizes saturate accuracy at around 256 input sequences. Finally,
accuracy at the same or lower total size (bytes). Despite we investigate various options for investing a given bit bud-
rapid progress, prior art methods are not Pareto-optimal at get, comparing e.g. longer codes (e.g. 1x15) vs multiple
2-bits: for instance, the previous best 2-bit L LAMA 2 13B codebooks with shorter codes (e.g. 2x8).
(QuIP#, Table 1) achieves Wiki2 PPL of 6.06, but one can
get much lower 5.21 PPL by using a 7B model with 4-bit 4.3. Inference Speed
quantization, which is smaller (see Appendix Table 10).
Although our primary objective is to maximize accuracy for
AQLM compression to strictly 2 bits for the same model a given model size, AQLM can also be practical in terms
is also below Pareto-optimality, as it is outperformed by of inference latency. To demonstrate this, we implemented
4-bit AQLM compression for L LAMA 2 7B (5.21 vs 5.65). efficient GPU and CPU kernels for a few hardware-friendly
To find the Pareto-optimal quantization bitwidth, we run configurations of AQLM. The results can be found in Ta-
experiments between 2-3 bits per parameter and report them ble 4. For GPU inference, we targeted quantized L LAMA 2
in Table 1, below horizontal bars. Thus, the Pareto-optimal models with 16-bit codebooks, corresponding to 2.07 bits
bitwidth for AQLM appears to be around 2.5 bits per param- for L LAMA 2 70B, 2.18 bits for 13B, and 2.29 bits for
eter (Table 1), at which point we are comparable to 5-bit 7B models (see Table 1), as well as a 2x8-bit codebook
AQLM for L LAMA 2 7B (Appendix Table 10). In turn, the model with perplexity 7.98 on Wiki2. For each model we
2.76-bit AQLM on 13B outperforms the uncompressed 7B benchmark the matrix-vector multiplication subroutine per-
model. As such, AQLM is the first algorithm to achieve formance on a standard layer. The results show that AQLM
Pareto-optimality at less than 3 bits per parameter. can execute at speeds comparable to or better than FP16.
End-to-end generative numbers on a preliminary Hugging-
4.2. Ablation analysis Face integration can be found in Appendix H: for instance,
we can achieve ∼ 6 tokens/s on L LAMA 2 70B in this set-
In Appendix D, we examine key design choices regarding
ting. We observe that multiple smaller codebooks allow
initialization, alternating optimization, the impact of the
efficient GPU cache utilization, leading to greater speedup,
fine-tuning protocol, distribution of codebooks vs groups, as
at the price of slightly lower accuracy.
well as other hyper-parameters. In brief, we first find that the
residual K-means initialization is critical for fast algorithm
Table 4: Speed of the FP16 gate_proj layer matrix-vector
convergence: when compared with random initialization,
multiplication in PyTorch, and relative AQLM speedups.
it needs significantly fewer training iterations. Second, to
validate our calibration fine-tuning procedure, we compare Llama 2 7B 13B 70B
it against 1) no fine-tuning, 2) fine-tuning only of non-linear
layers (e.g. RMSNorm) but not of codebook parameters, 2 bit speedup over FP16 on Nvidia RTX 3090 GPU
and 3) fine-tuning only the codebooks (but not other layers). Original (float16) 129 µs 190 µs 578 µs
The results, presented in full in Appendix D, show that fine- AQLM (Table 1) x1.31 x1.20 x1.20
tuning the codebook parameters has the highest impact on AQLM (2×8-bit) x1.57 x1.82 x3.05
accuracy, by far, while fine-tuning the RMSNorm only has
2 bit speedup over FP32 on Intel i9 CPU, 8 cores
minor impact. This validates our choice of leveraging the
calibration set for learned codebooks. Original (float32) 1.83 ms 3.12 ms 11.31 ms
AQLM (2×8-bit) x2.75 x3.54 x3.69
Further, we observe that increasing the number of sample AQLM (4×8-bit) x2.55 x3.02 x4.07
sequences in the range 128 to 4096 leads to a gradual PPL AQLM (8×8-bit) x2.29 x2.68 x4.03
8
Extreme LLM Compression via Additive Quantization
Next, we explore how to leverage AQLM to accelerate CPU lyzing large language models across training and scaling.
inference. As discussed in Section 2.2, additive quantization arXiv preprint arXiv:2304.01373, 2023.
can compute dot products efficiently if the codebook size is
small. One way to achieve it for AQLM is to replace each Blalock, D. and Guttag, J. Multiplying matrices without
16-bit codebook with a number of smaller 8-bit ones. This multiplying. In International Conference on Machine
leads to higher quantization error, but still outperforms the Learning, pp. 992–1004. PMLR, 2021.
baselines in terms of accuracy (see Appendix Table 9). The Burton, D., Shore, J., and Buck, J. A generalization of
results in Table 4 show that this also allows for up to 4x isolated word recognition using vector quantization. In
faster inference relative to FP32 on CPU. ICASSP ’83. IEEE International Conference on Acoustics,
Speech, and Signal Processing, volume 8, pp. 1021–1024,
5. Conclusion and Future Work 1983. doi: 10.1109/ICASSP.1983.1171915.
We presented AQLM, a new form of additive quantization Chee, J., Cai, Y., Kuleshov, V., and Sa, C. D. Quip: 2-bit
(AQ) targeted to LLM compression, which significantly im- quantization of large language models with guarantees,
proved the state-of-the-art results for LLM quantization in 2023.
the regime of 2 and 3 bits per weight. In terms of limita-
tions, AQLM is more computationally-expensive than direct Chen, S., Wang, W., and Pan, S. J. Deep neural network
post-training quantization methods, such as RTN or GPTQ, quantization via layer-wise optimization using limited
specifically because of the use of a more complex coding training data. Proceedings of the AAAI Conference
representation. Yet, despite the more sophisticated encoding on Artificial Intelligence, 33(01):3329–3336, Jul.
and decoding, we have shown AQLM lends itself to effi- 2019. doi: 10.1609/aaai.v33i01.33013329. URL
cient implementation on both CPU and GPU. Overall, we https://ojs.aaai.org/index.php/AAAI/
find it remarkable that, using AQLM, massive LLMs can be article/view/4206.
executed accurately and efficiently using little memory.
Chen, Y., Guan, T., and Wang, C. Approximate nearest
neighbor search by residual vector quantization. Sensors,
6. Acknowledgements 10(12):11259–11273, 2010.
Authors would like to thank Ruslan Svirschevski for his Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,
help in solving technical issues with AQLM and baselines. Schoenick, C., and Tafjord, O. Think you have solved
We also thank Tim Dettmers for helpful discussions on question answering? try arc, the ai2 reasoning challenge.
the structure of weights in modern LLMs and size-accuracy arXiv preprint arXiv:1803.05457, 2018.
trade-offs.The authors would also like to thank Daniil Pavlov
for his assistance with CPU benchmarking. Finally, authors Computer, T. Redpajama: an open dataset for
would like to thank the communities of ML enthusiasts training large language models, 2023. URL
known as LocalLLaMA3 and Petals community on discord4 https://github.com/togethercomputer/
for the crowd wisdom about running LLMs on consumer RedPajama-Data.
devices.
Dettmers, T. and Zettlemoyer, L. The case for 4-bit pre-
cision: k-bit inference scaling laws. arXiv preprint
References arXiv:2212.09720, 2022.
Babenko, A. and Lempitsky, V. Additive quantization for
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
extreme vector compression. In Proceedings of the IEEE
LLM.int8(): 8-bit matrix multiplication for transformers
Conference on Computer Vision and Pattern Recognition,
at scale. Advances in Neural Information Processing
pp. 931–938, 2014.
Systems 35: Annual Conference on Neural Information
Besag, J. On the statistical analysis of dirty pictures. Jour- Processing Systems 2022, NeurIPS 2022, 2022.
nal of the Royal Statistical Society Series B: Statistical
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,
Methodology, 48(3):259–279, 1986.
L. QLoRA: Efficient finetuning of quantized llms. arXiv
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., preprint arXiv:2305.14314, 2023a.
O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev,
Prashanth, U. S., Raff, E., et al. Pythia: A suite for ana- D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T.,
3
https://www.reddit.com/r/LocalLLaMA/ and Alistarh, D. Spqr: A sparse-quantized representation
4
https://github.com/bigscience-workshop/ for near-lossless llm weight compression. arXiv preprint
petals/ arXiv:2306.03078, 2023b.
9
Extreme LLM Compression via Additive Quantization
Fernández-Marqués, J., AbouElhamayed, A. F., Lane, Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
N. D., and Abdelfattah, M. S. Are we there yet? B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna,
product quantization and its hardware accelera- E. B., Bressand, F., et al. Mixtral of experts. arXiv
tion. ArXiv, abs/2305.18334, 2023. URL https: preprint arXiv:2401.04088, 2024.
//api.semanticscholar.org/CorpusID:
258967539. Kim, S., Hooper, C., Gholami, A., Dong, Z., Li,
X., Shen, S., Mahoney, M. W., and Keutzer, K.
Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit Squeezellm: Dense-and-sparse quantization. arXiv
compression of trillion-parameter models. arXiv preprint preprint arXiv:2306.07629, 2023.
arXiv:2310.16795, 2023.
Kingma, D. P. and Ba, J. Adam: A method for stochas-
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: tic optimization. International Conference on Learning
Accurate post-training quantization for generative pre- Representations (ICLR), 2015.
trained transformers. arXiv preprint arXiv:2210.17323,
2022a. Kurtic, E., Kuznedelev, D., Frantar, E., Goin, M., and Alis-
tarh, D. Sparse fine-tuning for inference acceleration of
Frantar, E., Singh, S. P., and Alistarh, D. Optimal large language models, 2023.
Brain Compression: A framework for accurate post-
training quantization and pruning. arXiv preprint Li, Z., Ni, B., Zhang, W., Yang, X., and Gao, W. Perfor-
arXiv:2208.11580, 2022b. Accepted to NeurIPS 2022, to mance guaranteed network acceleration via high-order
appear. residual quantization, 2017.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and
Foster, C., Golding, L., Hsu, J., McDonell, K., Muen- Han, S. Awq: Activation-aware weight quantization
nighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., for llm compression and acceleration. arXiv preprint
Wang, B., Wang, K., and Zou, A. A framework for few- arXiv:2306.00978, 2023.
shot language model evaluation, September 2021. URL
https://doi.org/10.5281/zenodo.5371628. Martinez, J., Clement, J., Hoos, H. H., and Little, J. J.
Revisiting additive quantization. In Computer Vision–
Ge, T., He, K., Ke, Q., and Sun, J. Optimized product ECCV 2016: 14th European Conference, Amsterdam, The
quantization. IEEE transactions on pattern analysis and Netherlands, October 11-14, 2016, Proceedings, Part II
machine intelligence, 36(4):744–755, 2013. 14, pp. 137–153. Springer, 2016.
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., Martinez, J., Zakhmi, S., Hoos, H. H., and Little, J. J. Lsq++:
and Keutzer, K. A survey of quantization methods Lower running time and higher recall in multi-codebook
for efficient neural network inference. arXiv preprint quantization. In Proceedings of the European Conference
arXiv:2103.13630, 2021. on Computer Vision (ECCV), pp. 491–506, 2018.
Gray, R. Vector quantization. IEEE ASSP Magazine, 1(2): McCarter, C. and Dronen, N. Look-ups are not
4–29, 1984. doi: 10.1109/MASSP.1984.1162229. (yet) all you need for deep learning inference.
ArXiv, abs/2207.05808, 2022. URL https:
Guo, R., Kumar, S., Choromanski, K., and Simcha, D. Quan- //api.semanticscholar.org/CorpusID:
tization based fast inner product search. In Artificial 250491319.
intelligence and statistics, pp. 482–490. PMLR, 2016.
Merity, S., Xiong, C., Bradbury, J., and Socher, R.
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl- Pointer sentinel mixture models. arXiv preprint
edge in a neural network, 2015. arXiv:1609.07843, 2016.
Jegou, H., Douze, M., and Schmid, C. Product quantization Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and
for nearest neighbor search. IEEE transactions on pattern Blankevoort, T. Up or down? Adaptive rounding for
analysis and machine intelligence, 33(1):117–128, 2010. post-training quantization. In International Conference
on Machine Learning (ICML), 2020.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Norouzi, M. and Fleet, D. J. Cartesian k-means. In Pro-
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint ceedings of the IEEE Conference on computer Vision and
arXiv:2310.06825, 2023. Pattern Recognition, pp. 3017–3024, 2013.
10
Extreme LLM Compression via Additive Quantization
Ozan, E. C., Kiranyaz, S., and Gabbouj, M. Com- Xiao, G., Lin, J., Seznec, M., Demouth, J., and Han,
petitive quantization for approximate nearest neighbor S. Smoothquant: Accurate and efficient post-training
search. IEEE Transactions on Knowledge and Data quantization for large language models. arXiv preprint
Engineering, 28(11):2884–2894, 2016. doi: 10.1109/ arXiv:2211.10438, 2022.
TKDE.2016.2597834.
Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and
Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y., and Lee, He, Y. Zeroquant: Efficient and affordable post-training
D. nuQmm: Quantized matmul for efficient inference of quantization for large-scale transformers. arXiv preprint
large-scale generative language models. arXiv preprint arXiv:2206.01861, 2022.
arXiv:2206.09557, 2022.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Y. Hellaswag: Can a machine really finish your sentence?
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.),
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, Proceedings of the 57th Conference of the Association
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., for Computational Linguistics, ACL 2019, Florence, Italy,
Bai, J., and Chintala, S. PyTorch: An imperative style, July 28- August 2, 2019, Volume 1: Long Papers, pp.
high-performance deep learning library. In Conference on 4791–4800. Association for Computational Linguistics,
Neural Information Processing Systems (NeurIPS). 2019. 2019. doi: 10.18653/v1/p19-1472. URL https://
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., doi.org/10.18653/v1/p19-1472.
Matena, M., Zhou, Y., Li, W., and Liu, P. Exploring Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
the limits of transfer learning with a unified text-to-text Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
transformer. Journal of Machine Learning Research, 21 et al. Opt: Open pre-trained transformer language models.
(140):1–67, 2020. arXiv preprint arXiv:2205.01068, 2022.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Zhang, T., Du, C., and Wang, J. Composite quantization for
Y. Winogrande: an adversarial winograd schema chal- approximate nearest neighbor search. In International
lenge at scale. Commun. ACM, 64(9):99–106, 2021. Conference on Machine Learning, pp. 838–846. PMLR,
doi: 10.1145/3474381. URL https://doi.org/ 2014.
10.1145/3474381.
Zhou, S.-C., Wang, Y.-Z., Wen, H., He, Q.-Y., and Zou,
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow,
Y.-H. Balanced quantization: An effective and efficient
D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M.,
approach to quantized neural networks. Journal of Com-
et al. Bloom: A 176b-parameter open-access multilingual
puter Science and Technology, 32(4):667–682, Jul 2017.
language model. arXiv preprint arXiv:2211.05100, 2022.
ISSN 1860-4749. doi: 10.1007/s11390-017-1750-y.
Sun, S., Cheng, Y., Gan, Z., and Liu, J. Patient knowledge URL https://doi.org/10.1007/s11390-017-
distillation for bert model compression. arXiv preprint 1750-y.
arXiv:1908.09355, 2019.
Tata, S. and Patel, J. M. PiQA: An algebra for querying pro-
tein data sets. In International Conference on Scientific
and Statistical Database Management, 2003.
TII UAE. The Falcon family of large language
models. https://huggingface.co/tiiuae/
falcon-40b, May 2023.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023.
Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and Sa, C. D.
Quip#: Quip with lattice codebooks, 2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
is all you need. arXiv preprint arXiv:1706.03762, 2017.
11
Extreme LLM Compression via Additive Quantization
• L2 loss: ∥x − y∥22
• Normalized L2 loss: ∥x − y∥22 /∥y∥22
(x,y)
• Cosine distance: 12 1 − ∥x∥∥y∥
We tune all models for a single epoch on the same calibration set used for additive quantization and block finetuning. We
observe that longer training leads to overfitting. Learning rate of η = 1e − 5 and batch size of 32 with Adam optimizer are
used in full model finetuning experiments. We find that simple L2 loss outperforms the alternatives (Table 5), and that the
number of intermediate activations used in the loss function barely affects the final performance (Table 6).
Table 5: Ablation on the choice of loss for Knowledge Table 6: Ablation on the choice of blocks subset for Knowl-
distillation. edge distillation. Empty set means no knowledge distillation
applied.
Loss Wiki2↓ C4↓
Blocks Wiki2↓ C4↓
L2 6.57 8.48
Normalized L2 6.63 8.57 {} 6.64 8.56
Cosine distance 6.65 8.59 {31} 6.52 8.51
{15, 31} 6.57 8.48
{7, 15, 23, 31} 6.55 8.49
{3, 7, 11, 15, 19, 23, 27, 31} 6.57 8.54
Overall, the improvement due to knowledge distillation is not very significant while increasing the complexity of the
quantization pipeline, therefore, we decided to not include KD in our main setup.
12
Extreme LLM Compression via Additive Quantization
B. Code reproducibility
We share the code for our method in the GitHub repository https://github.com/vahe1994/AQLM. The hyperpa-
rameters for our experimental setup are discussed in Appendix C.
C. Experimental Configurations
Hardware. In all of our experiments, we used either Nvidia A100 or H100. The number of GPUs varied from 1 to 8. We
used activation offloading to lower pick memory usage. To evaluate inference speed on GPU we used consumer-grade GPU
Nvidia 3090 and for CPU setup we used Intel core i9 13900k.
Calibration set. All methods were calibrated on a slice of RedPajama-v1 dataset (Computer, 2023) for both Llama and
Mistral/Mixtral family models. We used the same context length as models were trained on, for Llama-2 4096 and for
Mistral/Mixtral 8192.
For Llama-2 experiments, we used 8M tokens as a calibration set for SpQR, GPTQ, and AQLM. Quip, however, was
calibrated on 4M tokens due to OOM errors when trying to use more samples. Taking into account the fact that after 2M
tokens improvement of methods results is fairly small we chose to report these numbers as is. For Quip#, we used Llama-2
and Mistral’s quantized models provided by authors in their GitHub repository. To the best of our knowledge, they used 6k
samples for calibration with a context length of 4096/8192.
For Mixtral we calibrated both our method and QUIP# on 8M tokens with context length 8192.
Hyperparameters.
For GPTQ for both 3 and 4 bits we used a standard set of parameters without grouping and with permutation order act_order.
SpQR method was evaluated with base 2 and 3 bit-width with group size of 16 and 3 bits for zeros and scales. Outliers rate
was chosen such that average bit will be close to 3 and 4 bits respectively.
Quip was adapted to work on the Llama family and was calibrated with 1024 samples and 4096 context length.
Quip# For Llama-2 and Mistral models we used the officially published quantized models. For Mixtral we adapted the code
to work with the model’s architecture and quantized it with the recommended set of parameters. For both AQLM and QuIP#,
we don’t quantize gate linear layer in Mixtral, because it contains relatively small amount of paramters and have severe
impact on performance.
AQLM For to get 2, 3, 4 bits: we used 1 codebook size of 215 or 216 , with groups of 8 for 2 bits. For 3 bits we used 2
codebooks size of 212 with groups of 8. Finally for 4 bits we used 2 codebooks size of 215 or 216 with groups of 8.
Both for finetuning 3.4 and codebooks update 3.3 we used Adam optimizer (Kingma & Ba, 2015) with learning rate of
10−4 , β1 = 0.90 and β2 = 0.95. We used early stopping both for the finetuning phase and for the codebook optimization
phase, by stopping when the least square error not decreasing more than some threshold. In our experiments the threshold
varies between 10−2 and 10−3 .
D. Ablation analysis
The AQLM algorithm makes several design choices that need to be validated separately: initialization, alternating optimiza-
tion, the fine-tuning protocol, and the choice of hyperparameters. Here, we study how each of these components affect
results.
Initialization. As discussed in Section 3, we initialize AQLM with residual K-means to obtain a good initial guess for both
codes and codebooks. That is, we run K-means for the weight matrix, then subtract the nearest cluster from each weight,
and run K-means again M times. A simple baseline would be to initialize all codes uniformly at random. We compare
the two initialization strategies for the problem of quantizing a single linear layer within L LAMA 2 70B model to 3 bits
per parameter. We quantize groups of 8 consecutive weights using 2 codebooks, 12 bit each. Each codebook contains 212
learnable values. As we can see in Figure 4, AQLM with K-means initialization needs significantly fewer training iterations
to achieve the desired loss. The difference is so drastic that we expect that running AQLM with a random initialization
would require extremely high runtimes to accurately quantize the largest models.
Fine-tuning. Next, we validate the fine-tuning procedure. We compare the full block fine-tuning (default) against three
13
Extreme LLM Compression via Additive Quantization
100 Random
K-Means
1
10
MSE
2
10
3
10
Table 7: Ablation analysis of AQLM with different fine-tuning restrictions on Llama-2 7B model at 2.02 bit width.
Number of samples. We verify our choice of calibration hyperparameters. Traditionally, most PTQ algorithms use several
hundred calibration sequences (e.g. Frantar et al. (2022a) has 128). In our experiments, we evaluate both AQLM and
baselines with additional calibration data. Our original motivation for that was to avoid potential overfitting when fine-tuning
entire transformer blocks. To test this assumption, we run our algorithm with different calibration set sizes, varying from
128 to 4096 sequences. For each size, we report the average perplexity on WikiText2 over 3 runs, along with standard
deviations. The results in Table 8 demonstrate that increasing the number of samples leads to gradual reduction in perplexity
with seemingly diminishing returns. Since the perplexity is still monotonically improving from 128 to 4096 samples, it is
possible that larger sample sizes would yield further improvements.
Number of codebooks vs groups. Finally, we conducted an additional set of experiments on LLama-2 7B models to see
perplexity dependence on simultaneous change of both codebooks and groups keeping compression rate fixed to 2bits. 9
We can see that the simultaneous increase in both the number of codebooks and groups led to a decrease in average perplexity
and stopped when the number of groups reached 64.
14
Extreme LLM Compression via Additive Quantization
Table 10: Evaluation of quantized L LAMA 2 models for 4+ bits per parameter. The table reports perplexity on Wiki-
Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The Average accuracy
column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.
Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 5.12 6.63 67.25 78.45 56.69 69.32 40.02 62.35
AQLM 4.04 5.21 6.75 67.32 78.24 55.99 70.16 41.04 62.55
GPTQ 4.00 5.49 7.20 68.19 76.61 55.44 66.20 36.77 60.64
7B
SpQR 3.98 5.28 6.87 66.93 78.35 56.10 69.11 39.68 62.17
QuIP# 4.02 5.29 6.86 66.85 77.91 55.78 68.06 39.68 61.66
AQLM 5.02 5.16 6.68 67.40 78.29 56.53 68.94 39.93 62.22
– 16 4.57 6.05 69.61 78.73 59.72 73.27 45.56 65.38
AQLM 3.94 4.65 6.14 69.85 78.35 59.27 73.32 44.80 65.12
GPTQ 4 4.78 6.34 70.01 77.75 58.67 70.45 42.49 63.87
13B
SpQR 3.98 4.69 6.20 69.69 78.45 59.25 71.21 44.52 64.42
QuIP 4.00 4.76 6.29 69.69 79.00 58.91 73.27 44.88 65.15
QuIP# 4.01 4.68 6.20 69.38 77.91 58.86 73.74 44.63 64.90
– 16 3.12 4.97 76.95 81.07 63.99 77.74 51.11 70.17
AQLM 4.14 3.19 5.03 76.48 81.50 63.69 77.31 50.68 69.93
GPTQ 4.00 3.35 5.15 75.61 81.23 63.47 76.81 49.15 69.25
70B
SpQR 3.97 3.25 5.07 76.01 81.28 63.71 77.36 49.15 69.50
QuIP 4.00 3.58 5.38 76.01 80.25 61.97 74.28 47.01 67.90
QuIP# 4.01 3.22 5.05 76.80 81.45 63.51 78.37 50.85 70.20
AQLM 3.82 3.21 5.03 76.32 80.90 63.69 77.61 50.34 69.77
Table 8: Wikitext2 PPL as a function of calibration set size for Table 9: Wikitext2 PPL as a function of from groups
Llama 2 (7B) quantized to 2.3 bits with AQLM, averaged over and number of codebook for Llama 2 (7B) quantized
3 runs. SD stands for adjusted standard deviation. with approximately 2 bits quantization.
E. Additional experiments
In this section we report additional experimental results for Mixtral(Jiang et al., 2024) , Mistral7B(Jiang et al., 2023) and
LLama-2 model.
E.1. Mixtral
We report the results for Mixtral(Jiang et al., 2024) MoE-type model for 3 and 4 bits in Table 11. In the 4 bit case,
performance of QuIP# and AQLM are very similar across all metrics and close to uncompressed FP16 model.
15
Extreme LLM Compression via Additive Quantization
Table 11: Evaluation of quantized Mixtral (Jiang et al., 2024) models for 3 and 4 bits per parameter. The table reports
perplexity on WikiText2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The
Average accuracy column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and
Average accuracy.
Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16.00 3.46 5.02 75.45 82.37 64.65 83.38 55.80 72.33
3-bit
AQLM 3.02 3.8 5.18 74.74 81.77 63.22 82.28 52.30 70.86
– 16.00 3.46 5.02 75.45 82.37 64.65 83.38 55.80 72.33
4-bit AQLM 3.915 3.57 5.07 74.82 81.99 64.23 83.12 54.61 71.75
QuIP# 4.000 3.60 5.08 76.56 81.99 63.92 82.62 54.78 71.97
Table 13: Evaluation of quantized Mistral7B (Jiang et al., 2023) models for 2, 3 and 4 bits per parameter: perplexity on
WikiText2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The Average accuracy
column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.
Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16.00 4.77 5.71 73.64 80.47 61.15 78.87 49.23 68.67
2-bit AQLM 2.01 6.32 6.93 68.75 76.01 52.13 73.65 40.44 62.17
QuIP# 2.01 6.02 6.84 69.30 76.71 52.95 72.14 39.76 62.20
– 16.00 4.77 5.71 73.64 80.47 61.15 78.87 49.23 68.67
3-bit
AQLM 3.04 5.07 5.97 72.69 80.14 59.31 77.61 46.67 67.28
– 16.00 4.77 5.71 73.64 80.47 61.15 78.87 49.23 68.67
AQLM 4.02 4.89 5.81 73.80 79.71 60.27 77.86 48.21 67.97
4-bit
QuIP# 4.01 4.85 5.79 73.95 80.41 60.62 78.96 49.40 68.67
E.2. Llama-2
We show results for 4 bit quantization of the L LAMA 2 models in Table 10. We can see that AQLM outperforms other
methods in terms of perplexity and has the best or close to the best results. We also report results of perplexity for our
quantized 2x8 codebooks models in Table 12.
Table 12: Wikitext2 and C4 PPL for quantized Llama 2 models for 2x8 codebooks models.
Name Wiki2↓ C4↓
7B 7.98 10.38
70B 4.83 6.66
E.3. Mistral
Finally we evaluate AQLM and QuIP# quantization on Mistral7b (Jiang et al., 2023) model for 3 and 4 bits in Table 13. In
2bits, QuIP# slightly outperform AQLM on most benchmarks. And for 4bits setup results are very close across the board.
F. Pareto optimality
We visualize WikiText perplexity of Llama-2 7B,13B,70B models quantized with AQLM and QuIP# as plotted against
quantized weight size in bytes and report it in Figure 5. Our method outperforms QuIP# in terms of perplexity in WikiText2
across all model sizes.
Additionally, in Figure 6, we show perplexity on WikiText for AQLM method against size of quantized parameters. We can
notice that starting around 3.7 GiB of quantized weights, which correspond to 2.5 bits compression on Llama-2 13B model,
it is more advantageous to compress 13B model rather 7B model at the same model size in bytes.
16
Extreme LLM Compression via Additive Quantization
Perplexity on WikiText2
QuIP# 6.0 Llama-2 13B
7
Llama-2 70B
5.5
6 7B
5.0
5 7B 4.5 13B
13B 4.0
4
3.5
Figure 5: Comparison of AQLM relative to QuIP# on Figure 6: Model optimality for AQLM on LLAMA 2 7, 13,
LLAMA-2 7B, 13B, and 70B models. and 70B models.
memory required is (assuming that codebooks and scales are stored in half precision):
• codebooks: g · M · 2B · 16
• codes: dout · (din /g) · B
• scales: dout · 16
Therefore, the average bits per parameter can be computed as follows:
size in bits 16 g M 2B + dout (din /g)B + 16 dout
b̄ =
= (10)
number of parameters dout din
For example, for mlp.gate_proj layer of L LAMA 2 70B model with din = 8192, dout = 28672, quantization with
group size 8, two 8-bit codebooks the formula above yields 2.002 bits per parameter. Typically, storage cost is dominated by
the codes, whereas codebooks and scales induce small memory overhead.
17
Extreme LLM Compression via Additive Quantization
120000
0.04
100000
80000 0.02
PCA dim 2
Count
60000 0.00
40000 0.02
20000
0.04
0
0 5000 10000 15000 20000 25000 30000 0.04 0.02 0.00 0.02 0.04
Codebook id PCA dim 1
18