Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views18 pages

AQLM

Uploaded by

liujevon6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views18 pages

AQLM

Uploaded by

liujevon6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Extreme Compression of Large Language Models via Additive Quantization

Vage Egiazarian * 1 2 Andrei Panferov * 1 2 Denis Kuznedelev 2 3 Elias Frantar 4 Artem Babenko 2 Dan Alistarh 4 5

Abstract 8 QuIP# (2bit)

Perplexity on Wikitext2
The emergence of accurate open large language AQLM (2bit)
models (LLMs) has led to a race towards perfor- 7
Baseline (FP16)
arXiv:2401.06118v2 [cs.LG] 6 Feb 2024

mant quantization techniques which can enable 6


their execution on end-user devices. In this pa-
per, we revisit the problem of “extreme” LLM 5
compression—defined as targeting extremely low 4
bit counts, such as 2 to 3 bits per parameter—from
the point of view of classic methods in Multi- 3
7 13 70
Codebook Quantization (MCQ). Our algorithm,
#Parameters (×109)
called AQLM, generalizes the classic Additive
Quantization (AQ) approach for information re-
Figure 1: Comparison of AQLM (2-bit) relative to the state-
trieval to advance the state-of-the-art in LLM com-
of-the-art QuIP# (2-bit) and the original 16-bit weights on
pression, via two innovations: 1) learned additive
L LAMA 2 7, 13, and 70B models.
quantization of weight matrices in input-adaptive
fashion, and 2) joint optimization of codebook
inference and fine-tuning on compressed LLMs (Dettmers
parameters across entire layer blocks. Broadly,
et al., 2022; Frantar et al., 2022a; Dettmers & Zettlemoyer,
AQLM is the first scheme that is Pareto optimal
2022; Lin et al., 2023; Dettmers et al., 2023a). Currently, the
in terms of accuracy-vs-model-size when com-
primary approach for accurate post-training compression of
pressing to less than 3 bits per parameter, and
LLMs is quantization, which reduces the bit-width at which
significantly improves upon all known schemes in
model weights (and possibly activations) are stored, leading
the extreme compression (2bit) regime. In addi-
to improvements in model footprint and memory transfer.
tion, AQLM is practical: we provide fast GPU and
CPU implementations of AQLM for token gen- By and large, LLM weights are compressed via “direct”
eration, which enable us to match or outperform quantization, in the sense that a suitable quantization grid
optimized FP16 implementations for speed, while and normalization are first chosen for each matrix sub-
executing in a much smaller memory footprint. component, and then weights are each mapped onto the grid
either by direct rounding, e.g. (Dettmers & Zettlemoyer,
2022), or via more complex allocations, e.g. (Frantar et al.,
1. Introduction
2022a). Quantization induces a natural compression-vs-
The rapid advancement of generative large language models accuracy trade-off, usually measured in terms of model size
(LLMs) has led to massive industrial and popular interest, vs model perplexity (PPL). Existing approaches can achieve
driven in part by the availability of accurate open LLMs, arguably low accuracy loss at 3-4 bits per element (Dettmers
such as Llama 1 and 2 (Touvron et al., 2023), Falcon (TII et al., 2023b; Chee et al., 2023; Kim et al., 2023), and can
UAE, 2023), BLOOM (Scao et al., 2022), OPT (Zhang even stably compress models to 2 or even less bits per ele-
et al., 2022), or NeoX/Pythia (Biderman et al., 2023). A key ment, in particular, for extremely large models (Frantar &
advantage of open models is that they can be inferenced or Alistarh, 2023). Yet, in most cases, low bit counts come at
fine-tuned locally by end-users, assuming that their compu- the cost of significant drops in accuracy, higher implementa-
tational and memory costs can be reduced to be manageable tion complexity and runtime overheads. Specifically, from
on commodity hardware. This has led to several methods for the practical perspective, “extreme” quantization in the 2-bit
*
range using current techniques is inferior to simply using
Equal contribution 1 HSE University 2 Yandex Research a smaller base model and quantizing it to higher bitwidths,
3
Skoltech 4 IST Austria 5 NeuralMagic. Correspondence to:
<[email protected]>. such as 3-4 bits per parameter, as the latter yields higher
accuracy given the same model size in bytes (Dettmers &
Preliminary work. To be extended with additional experiments. Zettlemoyer, 2022; Chee et al., 2023).

1
Extreme LLM Compression via Additive Quantization

Contribution. In this work, we improve the state-of-the-art show show that our approach can match or even outper-
in LLM compression by showing for the first time that Multi- form the floating point baseline in terms of speed, while
Codebook Quantization (MCQ) techniques can be extended reducing the memory footprint by up to 8x. Specifi-
to LLM weight compression. Broadly, MCQ is a family cally, AQLM can be executed with layer-wise speedups
of information retrieval methods (Chen et al., 2010; Jegou of ∼ 30% for GPUs, and of up to 4x for CPU inference.
et al., 2010; Ge et al., 2013; Zhang et al., 2014; Babenko &
Lempitsky, 2014; Martinez et al., 2016; 2018), consisting of 2. Background & Related Work
specialized quantization algorithms to compress databases
of vectors, allowing for efficient search. Unlike direct quan- 2.1. LLM Quantization
tization, MCQ compresses multiple values jointly, by lever-
Early efforts towards post-training quantization (PTQ)
aging the mutual information of quantized values.
methods (Nagel et al., 2020; Gholami et al., 2021) that
More precisely, we extend Additive Quantization scale to LLMs such as ZeroQuant (Yao et al., 2022),
(AQ) (Babenko & Lempitsky, 2014; Martinez et al., 2016), LLM.int8() (Dettmers et al., 2022), and nuQmm (Park et al.,
a popular MCQ algorithm, to the task of compressing 2022) employed direct round-to-nearest (RTN) projections,
LLM weights such that the output of each layer and and adjusted quantization granularity to balance memory
Transformer block are approximately preserved. Our efficiency and accuracy. GPTQ (Frantar et al., 2022a) pro-
extension reformulates the classic AQ optimization problem posed a more accurate data-aware approach via an approxi-
to reduce the error in LLM layer outputs under the input mate large-scale solver for minimizing layer-wise ℓ2 errors.
token distribution and as well as to jointly optimize
Dettmers & Zettlemoyer (2022) examined the accuracy-
codes over layer blocks, rather than only preserving the
compression trade-offs of these early methods, suggesting
weights themselves as in standard AQ. We refer to the
that 4-bit quantization may be optimal for RTN quantization,
resulting procedure as Additive Quantization of Language
and observing that data-aware methods like GPTQ allow
Models (AQLM). Unlike some extreme LLM quantization
for higher compression, i.e. strictly below 4 bits/weight,
approaches that require hybrid sparse-quantized formats
maintaining Pareto optimality. Our work brings this Pareto
which separate outlier quantization (Kim et al., 2023;
frontier below 3 bits/weight, for the first time. Parallel
Dettmers et al., 2023b), AQLM quantizes models in a
work quantizing both weights and activations to 8-bits,
simple homogeneous format, which is easy to support in
by Dettmers et al. (2022), Xiao et al. (2022), and Yao et al.
practice. Our main contributions are as follows:
(2022) noted that the “outlier features” in large LLMs cause
1. We propose the AQLM algorithm, which extends AQ
substantial errors, prompting various mitigation strategies.
to post-training compression of LLM weights, via two
innovations: (1) adapting the MAP-MRF optimization Recently, several improved techniques have focused on the
problem behind AQ to be instance-aware, taking layer difficulty of quantizing weight outliers, which have high
calibration input & output activations into account; (2) impact on the output error. SpQR (Dettmers et al., 2023b)
complementing the layer-wise optimization with an addresses this by saving outliers as a highly-sparse higher-
efficient intra-layer tuning technique, which optimizes precision matrix. AWQ (Lin et al., 2023) reduces the error of
quantization parameters jointly over several layers, us- quantizing channels with the highest activation magnitudes
ing only the calibration data. by employing per-channel scaling to reduce the error on
2. We evaluate the effectiveness of this algorithm on the important weights. SqueezeLLM (Kim et al., 2023) uses the
task of compressing accurate open LLMs from the diagonal Fisher as a proxy for the Hessian and implements
L LAMA 2 (Touvron et al., 2023) family with com- non-uniform quantization through K-means clustering.
pression rates of 2-4 bits per parameter. We find that The state-of-the-art method in terms of accuracy-to-size
AQLM outperforms the previous state-of-the-art across trade-off is QuIP (Chee et al., 2023). Concurrent to our
the standard 2-4 bit compression range, with the most work, an improved variant called QuIP# (Tseng et al., 2023)
significant improvements for extreme 2-bit quantiza- was introduced. Roughly, these methods work by first
tion (see Figure 1). We provide detailed ablations for “smoothening” weights by multiplying with a rotation ma-
the impact of various algorithm parameters, such as trix, and then mapping them onto a lattice. QuIP was the
code width and number of codebooks, and extend our first method to obtain stable results (i.e., single-digit PPL
analysis to the recent Mixtral model (Jiang et al., 2024). increases) in the 2-bit per parameter compression range. At
a high level, QuIP and QuIP# aim to minimize the “worst-
3. We show that AQLM is practical, by providing efficient case” error for each layer, given initial weights and calibra-
GPU and CPU kernels implementations for specific tion data. For instance, in QuIP#, the distribution of the
encodings, as well as end-to-end generation1 . Results rotated weights approximates a Gaussian, while the encod-
1
https://github.com/vahe1994/AQLM ing lattice (E8P) is chosen to minimize “rounding” error.

2
Extreme LLM Compression via Additive Quantization

By contrast, our approach uses a different weight encoding tween database and query vectors by solving a constrained
(codebooks are additive), and learned codebooks instead optimization problem. Similarly to the formula above, this
of a fixed codebook. Thus, our insight is that we should be procedure allows for efficient inner product search by pre-
able to obtain higher accuracy by direct optimization of the computing dot products between the query q an all codes in
codebooks over the calibration set, removing the rotation. the learned codebooks, then adding these partial dot prod-
Further, we show that codebooks for different layers can ucts to recover the full similarity score.
co-train via joint fine-tuning over the calibration data.
Non-orthogonal quantizations. Follow-up work (Chen
et al., 2010; Babenko & Lempitsky, 2014; Martinez et al.,
2.2. Quantization for Nearest Neighbor Search 2016; Zhang et al., 2014; Ozan et al., 2016; Martinez et al.,
Our work builds on approximate nearest neighbor search 2018) generalized the idea of Product Quantization by ap-
(ANN) algorithms. Unlike PTQ, ANN quantization aims to proximating each vector by a sum of M codewords instead
compress a database of vectors to allow a user to efficiently of concatenation. The resulting procedure is still efficient
compute similarities and find nearest neighbors relative to while the approximation accuracy is increased.
a set of query points. For high compression, modern ANN For this, Residual Vector Quantization (Chen et al., 2010),
search algorithms employ vector quantization (VQ)—which quantizes original vectors, and then iteratively quantizes the
quantizes multiple vector dimensions jointly (Burton et al., approximation residuals from the previous iteration. Addi-
1983; Gray, 1984). It achieves this by learning “codebooks”: tive Quantization (AQ) (Babenko & Lempitsky, 2014) is
i.e. a set of learnable candidate vectors that can be used more general, as it does not impose constraints on the code-
to encode the data. To encode a given database vector, words from the different codebooks. Usually, AQ provides
VQ splits it into sub-groups of entries, then encodes every the smallest compression errors, but is more complex to
group by choosing a vector from the learned codebook. The train for large M . We discuss this in detail in Section 3.
algorithm efficiently computes distances or dot-products for
Finally, several recent works (Martinez et al., 2016; 2018;
similarity search by leveraging the linearity of dot products.
Zhang et al., 2014) elaborate the idea of Additive Quantiza-
Quantization methods for ANN search generalize vector tion, proposing the more effective procedure for codebooks
quantization and are referred to as multi-codebook quan- learning. Composite Quantization (CQ) (Zhang et al., 2014)
tization (MCQ). MCQ methods typically do not involve learns codebooks with a fixed value of scalar product be-
information loss on the query side, which makes them the tween the codewords from different codebooks. Currently,
leading approach for memory-efficient ANN (Ozan et al., the state-of-the-art compression accuracy is achieved by the
2016; Martinez et al., 2018). We briefly review MCQ below. LSQ method (Martinez et al., 2018).
Product quantization (PQ) (Jegou et al., 2010) is an early Vector quantization for model compression. There has
version of MCQ, which encodes each vector x ∈ RD as been significant work on exploiting vector quantization
D
a concatenation of M codewords from M M -dimensionalin the context of machine learning. For instance, Zhou
codebooks C1 , . . . , CM , each containing K codewords. PQ et al. (2017); Li et al. (2017); Chen et al. (2019) use
decomposes a vector into M separate subvectors and applies multi-codebook quantization to compress word embeddings
vector quantization (VQ) to each subvector, while using a within deep learning models. Another line of work (Blalock
separate codebook. Thus, each vector x is encoded by a & Guttag, 2021; McCarter & Dronen, 2022; Fernández-
tuple of codeword indices [i1 , . . . , iM ] and approximated byMarqués et al., 2023) explores vector quantization for linear
x ≈ [c1i1 , . . . , cM iM ]. Fast Euclidean distance computationmodels, or linear layers within deep models. Similarly to
becomes possible using lookup tables: PQ above, these techniques pre-compute inner products be-
M tween inputs and all codes, then compute linear layer via
X 2
∥q − x∥2 ≈ ∥q − [c1i1 , . . . , cM iM ]∥2 = ∥qm − cmim ∥ , look-up, which speeds up inference. However, these algo-
m=1 rithms introduce significant prediction error that does not
allow them to compress deep models. Thus, we believe we
where qm is the mth subvector of a query q. This sum
are the first to successfully adapt and scale MCQ to LLMs.
can be calculated using M additions and lookups if the dis-
tances from query subvectors to codewords are precomputed.
Since product-based approximations work better if the M D
- 3. AQLM: Additive Quantization for LLMs
dimensional components independent distributions, subse-
3.1. Overview
quent work has looked into finding better transformations
(Ge et al., 2013; Norouzi & Fleet, 2013). As for the other We start from the observation that additive quantization
similarity functions, (Guo et al., 2016) proposes a quantiza- (AQ) solves a related problem to post-training quantization
tion procedure for maximum inner product search (MIPS). (PTQ) (Nagel et al., 2020; Frantar et al., 2022b): both set-
They minimize quantization error in the inner products be- tings assume the existence of a set of “input” vectors, i.e.

3
Extreme LLM Compression via Additive Quantization

input data for AQ, and the weight matrix rows for PTQ. The
goal is to compress these inputs while preserving dot prod- M
!
uct similarity, against query vectors (for AQ), and against X
arg min||WX − Concati,j Cm bi,j,m X||22 . (3)
layer input embeddings (for PTQ). The difference between C,b m=1
the two is that AQ assumes that the distribution of queries
is unknown, whereas PTQ methods, e.g. (Frantar et al., To learn this weight representation, we initialize codebooks
2022b), show that it is sufficient to optimize for sample C and codes b by running residual K-means as in Chen et al.
input embeddings from a set of calibration data. (2010). Then, we alternate between updating codes bi,j,m
and codebooks Cm until the loss function (3) stops improv-
At a high level, we start by solving the following problem:
ing up to the specified tolerance. Since codes are discrete
for a linear layer with din input and dout output features
and codebooks are continuous, and we are optimizing over
given its weights W ∈ Rdout ×din and a set of calibration
multiple interacting layers, our approach has three phases,
inputs X ∈ Rdin ×n , one seeks for a configuration of quan-
described in Algorithm 1 and detailed below.
tized weights Ŵ that optimizes squared error between the
output of the original and compressed layer:
3.2. Phase 1: Beam search for codes
2
arg min||WX − WX||
c
2. (1)
First, AQLM updates the codes bi,j,m to minimize the MSE
W
c
objective (3). Similarly to Babenko & Lempitsky (2014);
In the following, we will assume that Wc is quantized using
Martinez et al. (2016; 2018), we reformulate the objective in
AQ, and adopt standard notation (Martinez et al., 2016). AQ terms of a fully-connected discrete Markov Random Field
splits weight rows into groups of g consecutive elements, (MRF) to take advantage of MRF solvers.
and represents each group of weights as a sum of M vec-
tors chosen from multiple learned codebooks C1 , ...CM , To simplify the derivation, let us first consider a special case
each containing 2B vectors (for B-bit codes). A weight is of a single output unit (dout =1) and a single quantization
encoded by choosing a single code from each codebook group (i.e.Pg=din ), to get rid of the concatenation operator:
M
and summing them up. We denote this choice as a one-hot ||WX − m=1 Cm bm X||22 . We rewrite this objective by
vector bm , which expanding the squared difference:
PM results in the following representation
for a group: m=1 Cm bijm . This is similar to PTQ algo- M
X
rithms (Frantar et al., 2022a), except for using much more ||WX − Cm bm X||22 = ||WX||22 −
complex coding per group. To represent the full weights, m=1
we simply concatenate: * M
X
+ M
X
M
X M
X − 2 WX , Cm bm X + || Cm bm X||22
W
c i= Cm bi,1,m ⊕ ... ⊕ Cm bi,din /g,m , (2) m=1 F m=1

m=1 m=1
(4)
B
where ⊕ denotes concatenation and bijm ∈ R2 represents Above, ⟨·, ·⟩F denotes a Frobenius inner product of two
a one-hot code for the i-th output unit, j-th group of input matrices. Next, let us consider the three components of
dimensions and m-th codebook. Eqn. (4) in isolation. First, note that ||WX||22 is constant
in b and can be ignored. The second component can be
expanded further into pairwise dot products:
M
X M X
X M
|| Cm bm X||22 = ⟨Ci bi X, Cj bj X⟩F . (5)
m=1 i=1 j=1

Note that both the second and third components rely on


Weight Matrix Codebooks Codes Frobenius products of Cm bm X-like matrices. These matri-
ces can be inconvenient in practice: since X ∈ Rdin ×n , the
Figure 2: Groups of weights are represented by a sum of size of each matrix will scale with the size of calibration
codes selected from codebooks by corresponding indices. dataset n. To circumvent this, we rewrite the products as:

Our algorithm will learn codebooks Cm ∈ Rg×2


B
⟨Ci bi X, Cj bj X⟩F = Ci bi XXT , Cj bj F
. (6)
and the discrete codes represented by one-hot b ∈
B
Rdout ×din /g×M ×2 . The resulting scheme encodes each Thus one can pre-compute XXT ∈ Rdin ×din . We will de-

Σ
m m
m Scales def
j
group of g weights j using M · B bits and further
j requires note this type of product as ⟨A, B⟩XXT = AXXT , B F
i g · 2B · 16 mbits ifor FP16 codebooks. The error
i
becomes: in future derivations. Then, Eqn. (4) becomes:

4
Expanded Codes Unscaled Rows Weight Matrix
Extreme LLM Compression via Additive Quantization

relying on the fact that each vector dimension can be opti-


mized independently. Our problem is complicated due to the
M
X presence of XXT : the optimal value of one codebook coor-
||WX − Cm bm X||22 = ||WX||22 − dinate depends on the values of all others. In principle, we
m=1 could optimize Cm in closed form, but it would require in-
M M X
M
X X verting a large matrix, or using iterative least squares solvers
−2 ⟨W, Cm bm ⟩XXT + ⟨Ci bi , Cj bj ⟩XXT . (e.g. conjugate gradients) specialized to this problem.
m=1 i=1 j=1
(7) For simplicity, our current implementation defaults to using
Adam (Kingma & Ba, 2015) for approximately solving this
Finally, we generalize this equation to multiple output units minimization problem. In practice, this codebook tuning
(dout > 1) and quantization groups (g̸=din ). For dout > 1, phase takes up a small fraction of the total compute time.
note that the original objective (3) is additive with respect to We compute the objective as follows:
output units: thus, we can apply (7) independently to each
2 2
output dimension and sum up results. To support multiple ||WX − WX||
c
2 = ||(W − W)X||2 =
c
input groups (g̸=din ), we can treat each group as a separate D
T
E
codebook where only the codes for the active group are = (W − W)XX
c , (W − W)
c , (8)
F
nonzero. Thus, we need to repeat each codebook din /g
times and pad it with zeros according to the active group. where W c is the quantized weight matrix from 2, and the
XXT matrix is pre-computed. We optimize this objective
It is now evident that minimizing (4) is equivalent to MAP
by iterating (non-stochastic) full-batch gradient descent.
inference in a Markov Random Field with ⟨W, Cm bm ⟩XXT
as unary potentials and ⟨Ci bi , Cj bj ⟩XXT as pairwise poten- For each update phase, our implementation runs 100 Adam
tials. While finding the exact optimum is infeasible, prior steps with learning rate 1e-4. However, we found that the fi-
work has shown that this type of MRF can be solved approx- nal result is not sensitive to either of these parameters: train-
imately via beam search or ICM (Besag, 1986). ing with smaller number of steps or learning rate achieves
the same loss, but takes longer to converge. In future work,
To solve this problem, we chose to adapt a beam search al-
these hyperparameters could be eliminated by switching
gorithm from Babenko & Lempitsky (2014). This algorithm
to dedicated least squares solver for codebooks. Similarly
maintains a beam of k (beam size) best configurations for
to other algorithms, we also learn per-unit scales s ∈ Rh
the codes, starting from the previous solution. On each step,
that are initialized as si := ||Wi ||2 and updated alongside
the algorithm attempts to replace one code by trying all 2B k
codebooks via the same optimizer (line 19 in Algorithm 1).
alternatives and selecting the k best based on MSE (7).
Since the loss function is additive, changing one code only 3.4. Phase 3: Fine-tuning for intra-layer cohesion
affects a small subset of loss components. Thus, we can
compute the loss function efficiently by starting with a pre- So far, our algorithm compresses each weight matrix inde-
vious loss function (before code replacement), then adding pendently of the rest of the model. However, in practice,
and subtracting the components that changed during this quantization errors interact differently between matrices.
iteration. These few loss components can be computed ef- This issue is especially relevant in the case of extreme (2-
ficiently by multiplying with XXT ahead of beam search. bit) compression, where quantization errors are larger.
The beam search runs over all dout output units in paral- Prior work addresses this issue via quantization-aware train-
lel. This is possible because encoding one output unit does ing (QAT), e.g. (Gholami et al., 2021). Instead of compress-
not affect the objective (7) of other units. Note that beam ing the entire model in a single pass, they quantize model
search is not necessarily the best solution to this problem. parameters gradually and train the remaining parameters to
AQ variants for retrieval (Martinez et al., 2016; 2018) use compensate for the quantization error. Unfortunately, run-
randomized ICM to find solutions faster. In this study, we ning QAT in our setting is infeasible, since most modern
chose beam search because it was easier to implement in LLMs are extremely expensive to train or even fine-tune.
ML frameworks like PyTorch/JAX. Thus, most PTQ algorithms for LLMs only adjust model pa-
rameters within the same linear layer (Frantar et al., 2022a;
3.3. Phase 2: Codebook update
Lin et al., 2023; Dettmers et al., 2023b).
In the second phase, we find the optimal codebook vec- Here, we opt for a middle ground by performing optimiza-
tors C1 , ..., CM that minimize the same squared error as tion at the level of individual transformer blocks, i.e. groups
the beam search. If we treat the codes b as constants, min- of 4-8 linear layers2 that constitute a single multi-head self-
imizing (3) becomes a least squares problem for Cm . The
2
original AQ algorithm solves this problem in closed form, This number depends on factors including the use of gated

5
Extreme LLM Compression via Additive Quantization
m

Σ
m
j Scales
j j j

i
i m i i

Codes Codes Codebooks Expanded Codes Unscaled Rows Weight Matrix

Figure 3: AQLM compressed weight format. Horizontal and vertical axes are input features and output units, respectively.
Depth represents the codebook index. Reconstruction procedure, from left to right: i) compressed weight codes ii) zoom-in
one weight group, each code is an index in its respective codebook iii) select codes from each codebook iv) add up codes as
in (2) v) multiply by scales (one scale per output dimension).

Algorithm 1 AQLM: Additive Quantization for LLMs models on a single GPU in reasonable time. Also, since the
Require: model, data algorithm only modifies a few trainable parameters, it uses
1: Xblock := model.input_embeddings(data) little VRAM for optimizer states. This fine-tuning converges
2: for i = 1, . . . , model.num_layers do after a few iterations, as it starts from a good initial guess.
3: block := model.get_block(i)
In practice, fine-tuning transformer layers takes a minority
4: Yblock := block(Xblock )
5: for layer ∈ linear_layers(block) do (10-30% or less) of the total calibration time.
6: W := layer.weight
7: X := layer_inputs(layer, Xblock ) 4. Experiments
8: C, b, s := initialize(W) // k-means
9: while loss improves by at least τ do We evaluate the AQLM algorithm in typical scenarios for
10: C, s := train_Cs_adam(XXT , W, C, b, s) post-training quantization of modern LLMs. Our evalua-
11: b := beam_search(XXT , W, C, b, s) tion is focused on the L LAMA 2 model family since it is a
12: end while
popular backbone for fine-tuned models or general LLM ap-
13: /* save for fine-tuning */
14: layer.weight := AQLMFormat(C, b, s) plications, e.g. (Dettmers et al., 2023a), and we also present
15: end for results on Mistral-family models (Jiang et al., 2024). In
16: θ := trainable_parameters(block) Section 4.1, we evaluate the full AQ procedure for various
17: while loss improves by at least τ do L LAMA 2 models and quantization bit-widths; Section 4.2
18: L := ||block(Xblock ) − Yblock ||22
presents an ablation analysis for individual AQ components
19: θ := adam(θ, ∂L ∂θ
)
20: end while and implementation details.
21: Yblock := block(Xblock )
22: end for 4.1. Compression quality for modern LLMs
We report perplexity on WikiText2 (Merity et al., 2016) and
attention, followed by a single and MLP layer. Having C4 (Raffel et al., 2020) validation sets. We also measure
quantized all linear layers within a single transformer block, zero-shot accuracy on WinoGrande (Sakaguchi et al., 2021),
we fine-tune its remaining parameters to better approximate PiQA (Tata & Patel, 2003), HellaSwag (Zellers et al., 2019),
the original outputs of that transformer block by backpropa- ARC-easy and ARC-challenge (Clark et al., 2018) via the
gating through the weight representation (2). LM Eval Harness (Gao et al., 2021). We broadly follow the
Concretely, we use the PyTorch autograd engine to differ- evaluation setup of GPTQ (Frantar et al., 2022a).
entiate the ||block(Xblock ) − Yblock ||2 , where Xblock are We consider three main targets in terms of compression
the inputs activations for that transformer block and Yblock ranges: 2-2.8 bits, 3-3.1 bits, and 4-4.1 bits per model pa-
are output activations of block(Xblock ) recorded prior to rameter. In the results below average bits per parameter
quantization. We train the codebooks Cm , scale vectors s takes into account only quantized weights, we do not include
and all non-quantized parameters (RMSNorm scales and parameters kept in floating precision similarly to the related
biases), while keeping the codes bi,j,m frozen. Similarly work. The details on the model size estimate are provided
to Section 3.3, we train these parameters using Adam to in Appendix G. We compare AQ against GPTQ for 3&4
minimize the MSE against the original block outputs (prior bits (Frantar et al., 2022a), SpQR for 3&4 bits (Dettmers
to quantization). This phase uses the same calibration data et al., 2023b), QuIP in 2,3 & 4 bits (Chee et al., 2023) and
as for the individual layer quantization. The full procedure QuIP# for 2&4 bits (Tseng et al., 2023). While GPTQ and
is summarized in Alg. 1. SpQR technically support 2-bit quantization, they perform
While fine-tuning blocks is more expensive than individual poorly in the 2-3 bit range. For QuIP, we omit results for the
linear layers, it is still possible to quantize billion-parameter 7B model, as we could not achieve competitive performance
in this one scenario using the available implementations.
GLU activations, group query attention and QKA weight merging. (Currently, there is no official implementation of the origi-

6
Extreme LLM Compression via Additive Quantization
Table 1: Evaluation of quantized L LAMA 2 models for 2-2.8 bits per parameter, with an extra section for higher bitwidth.
We report perplexity on WikiText2 (Merity et al., 2016) & C4 (Raffel et al., 2020) and accuracy for zero-shot tasks. The
Average accuracy is the mean of 5 zero-shot tasks. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.

Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 5.12 6.63 67.25 78.45 56.69 69.32 40.02 62.35
AQLM 2.02 6.64 8.56 64.17 73.56 49.49 61.87 33.28 56.47
7B QuIP# 2.02 8.22 11.01 62.43 71.38 42.94 55.56 28.84 52.23
AQLM 2.29 6.29 8.11 65.67 74.92 50.88 66.50 34.90 58.57
– 16 4.57 6.05 69.61 78.73 59.72 73.27 45.56 65.38
AQLM 1.97 5.65 7.51 65.43 76.22 53.74 69.78 37.8 60.59
QuIP 2.00 13.48 16.16 52.80 62.02 35.80 45.24 23.46 43.86
13B QuIP# 2.01 6.06 8.07 63.38 74.76 51.58 64.06 33.96 57.55
AQLM 2.18 5.41 7.20 68.43 76.22 54.68 69.15 39.42 61.58
AQLM 2.53 5.15 6.80 68.11 76.99 56.54 71.38 40.53 62.71
AQLM 2.76 4.94 6.54 68.98 77.58 57.71 72.90 43.60 64.15
– 16 3.12 4.97 76.95 81.07 63.99 77.74 51.11 70.17
AQLM 2.07 3.94 5.72 75.93 80.43 61.79 77.68 47.93 68.75
70B
QuIP 2.01 5.90 8.17 67.48 74.76 50.45 62.16 33.96 57.76
QuIP# 2.01 4.16 6.01 74.11 79.76 60.01 76.85 47.61 67.67

Table 2: Evaluation of quantized L LAMA 2 models for 3-3.1 bits per parameter, with the same metrics as in Table 1.

Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 5.12 6.63 67.25 78.45 56.69 69.32 40.02 62.35
AQLM 3.04 5.46 7.08 66.93 76.88 54.12 68.06 38.40 60.88
7B GPTQ 3.00 8.06 10.61 59.19 71.49 45.21 58.46 31.06 53.08
SpQR 2.98 6.20 8.20 63.54 74.81 51.85 67.42 37.71 59.07
– 16 4.57 6.05 69.61 78.73 59.72 73.27 45.56 65.38
AQLM 3.03 4.82 6.37 68.43 77.26 58.30 70.88 42.58 64.49
13B GPTQ 3.00 5.85 7.86 63.93 76.50 53.47 65.66 38.48 59.61
SpQR 2.98 5.28 7.06 67.48 77.20 56.34 69.78 39.16 61.99
QuIP 3.00 5.12 6.79 69.93 76.88 57.07 70.41 41.47 63.15
– 16 3.12 4.97 76.95 81.07 63.99 77.74 51.11 70.17
AQLM 3.01 3.36 5.17 77.19 81.28 63.23 77.61 50.00 69.86
70B GPTQ 3.00 4.40 6.26 71.82 78.40 60.00 72.73 44.11 65.41
SpQR 2.98 3.85 5.63 74.66 80.52 61.95 75.93 48.04 68.22
QuIP 3.01 3.87 5.67 74.59 79.98 60.73 73.19 46.33 66.96

nal QuIP (non-#) for the L LAMA 2 model.) For QuIP#, we PPL across standard validation sets (Wiki-Text2 and C4),
focus on 2 and 4 bit because the available implementation and accuracy across zero-shot tasks. Specifically, we ob-
does not yet support 3-bit compression. We calibrate each serve the highest accuracy gains in the “extreme” 2-2.1 bits
algorithm using the subset of RedPajama dataset (Computer, per parameter range, where the deviation from the uncom-
2023), with a sequence length of 4096. pressed model becomes large for all methods.
The exact bit-widths for each method are dictated by param-
eters such as the number of codebooks and code width. We Mixtral quantization. Table 3 presents results on the
report results for the 2−2.8 and 3−3.1 bitwidth ranges in Mixtral MoE-type model, comparing against QuIP# at 2-bits.
Tables 1 and 2, respectively. Additional results for 4 − 4.1 (See Appendix E.1 for full results.) AQLM outperforms
bits are deferred to Appendix E.2. QuIP# in this case as well. Although the margins are lower
compared to L LAMA 2 models, they are still significant for
The results show that AQLM outperforms the previous best
“harder” tasks, such as Arc Challenge (+3 points).
PTQ algorithms across all settings, often by wide margins,
especially at high compression. This holds both in terms of Pareto optimality of AQLM. The significant error improve-

7
Extreme LLM Compression via Additive Quantization

Table 3: Evaluation of quantized Mixtral (Jiang et al., 2024) models for 2 bits. The table reports perplexity on Wiki-
Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The Average accuracy
column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.

Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 3.46 5.02 75.45 82.37 64.65 83.38 55.80 72.33
AQLM 1.98 4.61 5.75 73.64 79.27 57.91 78.96 48.63 67.68
8x7B
QuIP# 2.01 4.75 5.89 71.11 79.05 58.23 77.57 45.73 66.34

ments raise the question of choosing the “optimal” model improvement, but with diminishing returns. In this respect,
variant to maximize accuracy within a certain memory bud- AQLM benefits more from larger calibration sets (similarly
get. For this, we follow Dettmers & Zettlemoyer (2022): a to QuIP#), as opposed to direct methods like GPTQ which
quantized model is said to be Pareto-optimal if it maximizes saturate accuracy at around 256 input sequences. Finally,
accuracy at the same or lower total size (bytes). Despite we investigate various options for investing a given bit bud-
rapid progress, prior art methods are not Pareto-optimal at get, comparing e.g. longer codes (e.g. 1x15) vs multiple
2-bits: for instance, the previous best 2-bit L LAMA 2 13B codebooks with shorter codes (e.g. 2x8).
(QuIP#, Table 1) achieves Wiki2 PPL of 6.06, but one can
get much lower 5.21 PPL by using a 7B model with 4-bit 4.3. Inference Speed
quantization, which is smaller (see Appendix Table 10).
Although our primary objective is to maximize accuracy for
AQLM compression to strictly 2 bits for the same model a given model size, AQLM can also be practical in terms
is also below Pareto-optimality, as it is outperformed by of inference latency. To demonstrate this, we implemented
4-bit AQLM compression for L LAMA 2 7B (5.21 vs 5.65). efficient GPU and CPU kernels for a few hardware-friendly
To find the Pareto-optimal quantization bitwidth, we run configurations of AQLM. The results can be found in Ta-
experiments between 2-3 bits per parameter and report them ble 4. For GPU inference, we targeted quantized L LAMA 2
in Table 1, below horizontal bars. Thus, the Pareto-optimal models with 16-bit codebooks, corresponding to 2.07 bits
bitwidth for AQLM appears to be around 2.5 bits per param- for L LAMA 2 70B, 2.18 bits for 13B, and 2.29 bits for
eter (Table 1), at which point we are comparable to 5-bit 7B models (see Table 1), as well as a 2x8-bit codebook
AQLM for L LAMA 2 7B (Appendix Table 10). In turn, the model with perplexity 7.98 on Wiki2. For each model we
2.76-bit AQLM on 13B outperforms the uncompressed 7B benchmark the matrix-vector multiplication subroutine per-
model. As such, AQLM is the first algorithm to achieve formance on a standard layer. The results show that AQLM
Pareto-optimality at less than 3 bits per parameter. can execute at speeds comparable to or better than FP16.
End-to-end generative numbers on a preliminary Hugging-
4.2. Ablation analysis Face integration can be found in Appendix H: for instance,
we can achieve ∼ 6 tokens/s on L LAMA 2 70B in this set-
In Appendix D, we examine key design choices regarding
ting. We observe that multiple smaller codebooks allow
initialization, alternating optimization, the impact of the
efficient GPU cache utilization, leading to greater speedup,
fine-tuning protocol, distribution of codebooks vs groups, as
at the price of slightly lower accuracy.
well as other hyper-parameters. In brief, we first find that the
residual K-means initialization is critical for fast algorithm
Table 4: Speed of the FP16 gate_proj layer matrix-vector
convergence: when compared with random initialization,
multiplication in PyTorch, and relative AQLM speedups.
it needs significantly fewer training iterations. Second, to
validate our calibration fine-tuning procedure, we compare Llama 2 7B 13B 70B
it against 1) no fine-tuning, 2) fine-tuning only of non-linear
layers (e.g. RMSNorm) but not of codebook parameters, 2 bit speedup over FP16 on Nvidia RTX 3090 GPU
and 3) fine-tuning only the codebooks (but not other layers). Original (float16) 129 µs 190 µs 578 µs
The results, presented in full in Appendix D, show that fine- AQLM (Table 1) x1.31 x1.20 x1.20
tuning the codebook parameters has the highest impact on AQLM (2×8-bit) x1.57 x1.82 x3.05
accuracy, by far, while fine-tuning the RMSNorm only has
2 bit speedup over FP32 on Intel i9 CPU, 8 cores
minor impact. This validates our choice of leveraging the
calibration set for learned codebooks. Original (float32) 1.83 ms 3.12 ms 11.31 ms
AQLM (2×8-bit) x2.75 x3.54 x3.69
Further, we observe that increasing the number of sample AQLM (4×8-bit) x2.55 x3.02 x4.07
sequences in the range 128 to 4096 leads to a gradual PPL AQLM (8×8-bit) x2.29 x2.68 x4.03

8
Extreme LLM Compression via Additive Quantization

Next, we explore how to leverage AQLM to accelerate CPU lyzing large language models across training and scaling.
inference. As discussed in Section 2.2, additive quantization arXiv preprint arXiv:2304.01373, 2023.
can compute dot products efficiently if the codebook size is
small. One way to achieve it for AQLM is to replace each Blalock, D. and Guttag, J. Multiplying matrices without
16-bit codebook with a number of smaller 8-bit ones. This multiplying. In International Conference on Machine
leads to higher quantization error, but still outperforms the Learning, pp. 992–1004. PMLR, 2021.
baselines in terms of accuracy (see Appendix Table 9). The Burton, D., Shore, J., and Buck, J. A generalization of
results in Table 4 show that this also allows for up to 4x isolated word recognition using vector quantization. In
faster inference relative to FP32 on CPU. ICASSP ’83. IEEE International Conference on Acoustics,
Speech, and Signal Processing, volume 8, pp. 1021–1024,
5. Conclusion and Future Work 1983. doi: 10.1109/ICASSP.1983.1171915.
We presented AQLM, a new form of additive quantization Chee, J., Cai, Y., Kuleshov, V., and Sa, C. D. Quip: 2-bit
(AQ) targeted to LLM compression, which significantly im- quantization of large language models with guarantees,
proved the state-of-the-art results for LLM quantization in 2023.
the regime of 2 and 3 bits per weight. In terms of limita-
tions, AQLM is more computationally-expensive than direct Chen, S., Wang, W., and Pan, S. J. Deep neural network
post-training quantization methods, such as RTN or GPTQ, quantization via layer-wise optimization using limited
specifically because of the use of a more complex coding training data. Proceedings of the AAAI Conference
representation. Yet, despite the more sophisticated encoding on Artificial Intelligence, 33(01):3329–3336, Jul.
and decoding, we have shown AQLM lends itself to effi- 2019. doi: 10.1609/aaai.v33i01.33013329. URL
cient implementation on both CPU and GPU. Overall, we https://ojs.aaai.org/index.php/AAAI/
find it remarkable that, using AQLM, massive LLMs can be article/view/4206.
executed accurately and efficiently using little memory.
Chen, Y., Guan, T., and Wang, C. Approximate nearest
neighbor search by residual vector quantization. Sensors,
6. Acknowledgements 10(12):11259–11273, 2010.
Authors would like to thank Ruslan Svirschevski for his Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,
help in solving technical issues with AQLM and baselines. Schoenick, C., and Tafjord, O. Think you have solved
We also thank Tim Dettmers for helpful discussions on question answering? try arc, the ai2 reasoning challenge.
the structure of weights in modern LLMs and size-accuracy arXiv preprint arXiv:1803.05457, 2018.
trade-offs.The authors would also like to thank Daniil Pavlov
for his assistance with CPU benchmarking. Finally, authors Computer, T. Redpajama: an open dataset for
would like to thank the communities of ML enthusiasts training large language models, 2023. URL
known as LocalLLaMA3 and Petals community on discord4 https://github.com/togethercomputer/
for the crowd wisdom about running LLMs on consumer RedPajama-Data.
devices.
Dettmers, T. and Zettlemoyer, L. The case for 4-bit pre-
cision: k-bit inference scaling laws. arXiv preprint
References arXiv:2212.09720, 2022.
Babenko, A. and Lempitsky, V. Additive quantization for
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
extreme vector compression. In Proceedings of the IEEE
LLM.int8(): 8-bit matrix multiplication for transformers
Conference on Computer Vision and Pattern Recognition,
at scale. Advances in Neural Information Processing
pp. 931–938, 2014.
Systems 35: Annual Conference on Neural Information
Besag, J. On the statistical analysis of dirty pictures. Jour- Processing Systems 2022, NeurIPS 2022, 2022.
nal of the Royal Statistical Society Series B: Statistical
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,
Methodology, 48(3):259–279, 1986.
L. QLoRA: Efficient finetuning of quantized llms. arXiv
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., preprint arXiv:2305.14314, 2023a.
O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev,
Prashanth, U. S., Raff, E., et al. Pythia: A suite for ana- D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T.,
3
https://www.reddit.com/r/LocalLLaMA/ and Alistarh, D. Spqr: A sparse-quantized representation
4
https://github.com/bigscience-workshop/ for near-lossless llm weight compression. arXiv preprint
petals/ arXiv:2306.03078, 2023b.

9
Extreme LLM Compression via Additive Quantization

Fernández-Marqués, J., AbouElhamayed, A. F., Lane, Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary,
N. D., and Abdelfattah, M. S. Are we there yet? B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna,
product quantization and its hardware accelera- E. B., Bressand, F., et al. Mixtral of experts. arXiv
tion. ArXiv, abs/2305.18334, 2023. URL https: preprint arXiv:2401.04088, 2024.
//api.semanticscholar.org/CorpusID:
258967539. Kim, S., Hooper, C., Gholami, A., Dong, Z., Li,
X., Shen, S., Mahoney, M. W., and Keutzer, K.
Frantar, E. and Alistarh, D. Qmoe: Practical sub-1-bit Squeezellm: Dense-and-sparse quantization. arXiv
compression of trillion-parameter models. arXiv preprint preprint arXiv:2306.07629, 2023.
arXiv:2310.16795, 2023.
Kingma, D. P. and Ba, J. Adam: A method for stochas-
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: tic optimization. International Conference on Learning
Accurate post-training quantization for generative pre- Representations (ICLR), 2015.
trained transformers. arXiv preprint arXiv:2210.17323,
2022a. Kurtic, E., Kuznedelev, D., Frantar, E., Goin, M., and Alis-
tarh, D. Sparse fine-tuning for inference acceleration of
Frantar, E., Singh, S. P., and Alistarh, D. Optimal large language models, 2023.
Brain Compression: A framework for accurate post-
training quantization and pruning. arXiv preprint Li, Z., Ni, B., Zhang, W., Yang, X., and Gao, W. Perfor-
arXiv:2208.11580, 2022b. Accepted to NeurIPS 2022, to mance guaranteed network acceleration via high-order
appear. residual quantization, 2017.

Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and
Foster, C., Golding, L., Hsu, J., McDonell, K., Muen- Han, S. Awq: Activation-aware weight quantization
nighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., for llm compression and acceleration. arXiv preprint
Wang, B., Wang, K., and Zou, A. A framework for few- arXiv:2306.00978, 2023.
shot language model evaluation, September 2021. URL
https://doi.org/10.5281/zenodo.5371628. Martinez, J., Clement, J., Hoos, H. H., and Little, J. J.
Revisiting additive quantization. In Computer Vision–
Ge, T., He, K., Ke, Q., and Sun, J. Optimized product ECCV 2016: 14th European Conference, Amsterdam, The
quantization. IEEE transactions on pattern analysis and Netherlands, October 11-14, 2016, Proceedings, Part II
machine intelligence, 36(4):744–755, 2013. 14, pp. 137–153. Springer, 2016.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., Martinez, J., Zakhmi, S., Hoos, H. H., and Little, J. J. Lsq++:
and Keutzer, K. A survey of quantization methods Lower running time and higher recall in multi-codebook
for efficient neural network inference. arXiv preprint quantization. In Proceedings of the European Conference
arXiv:2103.13630, 2021. on Computer Vision (ECCV), pp. 491–506, 2018.

Gray, R. Vector quantization. IEEE ASSP Magazine, 1(2): McCarter, C. and Dronen, N. Look-ups are not
4–29, 1984. doi: 10.1109/MASSP.1984.1162229. (yet) all you need for deep learning inference.
ArXiv, abs/2207.05808, 2022. URL https:
Guo, R., Kumar, S., Choromanski, K., and Simcha, D. Quan- //api.semanticscholar.org/CorpusID:
tization based fast inner product search. In Artificial 250491319.
intelligence and statistics, pp. 482–490. PMLR, 2016.
Merity, S., Xiong, C., Bradbury, J., and Socher, R.
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl- Pointer sentinel mixture models. arXiv preprint
edge in a neural network, 2015. arXiv:1609.07843, 2016.

Jegou, H., Douze, M., and Schmid, C. Product quantization Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and
for nearest neighbor search. IEEE transactions on pattern Blankevoort, T. Up or down? Adaptive rounding for
analysis and machine intelligence, 33(1):117–128, 2010. post-training quantization. In International Conference
on Machine Learning (ICML), 2020.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Norouzi, M. and Fleet, D. J. Cartesian k-means. In Pro-
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint ceedings of the IEEE Conference on computer Vision and
arXiv:2310.06825, 2023. Pattern Recognition, pp. 3017–3024, 2013.

10
Extreme LLM Compression via Additive Quantization

Ozan, E. C., Kiranyaz, S., and Gabbouj, M. Com- Xiao, G., Lin, J., Seznec, M., Demouth, J., and Han,
petitive quantization for approximate nearest neighbor S. Smoothquant: Accurate and efficient post-training
search. IEEE Transactions on Knowledge and Data quantization for large language models. arXiv preprint
Engineering, 28(11):2884–2894, 2016. doi: 10.1109/ arXiv:2211.10438, 2022.
TKDE.2016.2597834.
Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and
Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y., and Lee, He, Y. Zeroquant: Efficient and affordable post-training
D. nuQmm: Quantized matmul for efficient inference of quantization for large-scale transformers. arXiv preprint
large-scale generative language models. arXiv preprint arXiv:2206.01861, 2022.
arXiv:2206.09557, 2022.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Y. Hellaswag: Can a machine really finish your sentence?
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.),
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, Proceedings of the 57th Conference of the Association
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., for Computational Linguistics, ACL 2019, Florence, Italy,
Bai, J., and Chintala, S. PyTorch: An imperative style, July 28- August 2, 2019, Volume 1: Long Papers, pp.
high-performance deep learning library. In Conference on 4791–4800. Association for Computational Linguistics,
Neural Information Processing Systems (NeurIPS). 2019. 2019. doi: 10.18653/v1/p19-1472. URL https://
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., doi.org/10.18653/v1/p19-1472.
Matena, M., Zhou, Y., Li, W., and Liu, P. Exploring Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
the limits of transfer learning with a unified text-to-text Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
transformer. Journal of Machine Learning Research, 21 et al. Opt: Open pre-trained transformer language models.
(140):1–67, 2020. arXiv preprint arXiv:2205.01068, 2022.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Zhang, T., Du, C., and Wang, J. Composite quantization for
Y. Winogrande: an adversarial winograd schema chal- approximate nearest neighbor search. In International
lenge at scale. Commun. ACM, 64(9):99–106, 2021. Conference on Machine Learning, pp. 838–846. PMLR,
doi: 10.1145/3474381. URL https://doi.org/ 2014.
10.1145/3474381.
Zhou, S.-C., Wang, Y.-Z., Wen, H., He, Q.-Y., and Zou,
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow,
Y.-H. Balanced quantization: An effective and efficient
D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M.,
approach to quantized neural networks. Journal of Com-
et al. Bloom: A 176b-parameter open-access multilingual
puter Science and Technology, 32(4):667–682, Jul 2017.
language model. arXiv preprint arXiv:2211.05100, 2022.
ISSN 1860-4749. doi: 10.1007/s11390-017-1750-y.
Sun, S., Cheng, Y., Gan, Z., and Liu, J. Patient knowledge URL https://doi.org/10.1007/s11390-017-
distillation for bert model compression. arXiv preprint 1750-y.
arXiv:1908.09355, 2019.
Tata, S. and Patel, J. M. PiQA: An algebra for querying pro-
tein data sets. In International Conference on Scientific
and Statistical Database Management, 2003.
TII UAE. The Falcon family of large language
models. https://huggingface.co/tiiuae/
falcon-40b, May 2023.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023.
Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and Sa, C. D.
Quip#: Quip with lattice codebooks, 2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
is all you need. arXiv preprint arXiv:1706.03762, 2017.

11
Extreme LLM Compression via Additive Quantization

A. Full model fine-tuning


The block-wise finetuning procedure, introduced in 3.4, considerably improves performance of compressed models. However,
block-wise finetuning optimizes the loss only at the level of a current transformer block and is agnostic of the actual task
of interest. To minimize the target loss, one can run backpropagation through the whole model and optimize all trainable
parameters to directly optimize the end objective.
This allows to search for globally optimal parameters, as opposed to sequentially selected ones, during block-wise finetuning.
One can minimize the error between the quantized model and the floating-point model on some calibration set. The
parameters being optimized (namely the codebooks, scales and the non-quantized parameters) typically constitute a small
fraction of the total number of parameters in the original model. Therefore, the proposed distillation method is an instance
of parameter-efficient finetuning.
Standard Knowledge distillation (Hinton et al., 2015) optimizes the KL-divergence between the outputs of teacher and
student models. However, according to the literature, (Sun et al., 2019; Kurtic et al., 2023) transfer of intermediate
representation facilitates faster convergence and enhances performance. Hence, we decided to perform knowledge transfer
between some intermediate activations in addition to the model output. Specifically, we take a subset of transformer blocks
{l0 , l1 . . . , lK−1 }, li ∈ [0, L − 1], where L is the number of transformers block in the model, and optimize the following
objective:
K−1
(s) (t)
X
L(xli , xli ) (9)
i=0
(s) (t)
Where xliand xli are the student and teacher output activations respectively at block li and L(∗, ∗) is some loss function.
We considered the following options:

• L2 loss: ∥x − y∥22
• Normalized L2 loss: ∥x − y∥22 /∥y∥22
 
(x,y)
• Cosine distance: 12 1 − ∥x∥∥y∥

We tune all models for a single epoch on the same calibration set used for additive quantization and block finetuning. We
observe that longer training leads to overfitting. Learning rate of η = 1e − 5 and batch size of 32 with Adam optimizer are
used in full model finetuning experiments. We find that simple L2 loss outperforms the alternatives (Table 5), and that the
number of intermediate activations used in the loss function barely affects the final performance (Table 6).

Table 5: Ablation on the choice of loss for Knowledge Table 6: Ablation on the choice of blocks subset for Knowl-
distillation. edge distillation. Empty set means no knowledge distillation
applied.
Loss Wiki2↓ C4↓
Blocks Wiki2↓ C4↓
L2 6.57 8.48
Normalized L2 6.63 8.57 {} 6.64 8.56
Cosine distance 6.65 8.59 {31} 6.52 8.51
{15, 31} 6.57 8.48
{7, 15, 23, 31} 6.55 8.49
{3, 7, 11, 15, 19, 23, 27, 31} 6.57 8.54

Overall, the improvement due to knowledge distillation is not very significant while increasing the complexity of the
quantization pipeline, therefore, we decided to not include KD in our main setup.

12
Extreme LLM Compression via Additive Quantization

B. Code reproducibility
We share the code for our method in the GitHub repository https://github.com/vahe1994/AQLM. The hyperpa-
rameters for our experimental setup are discussed in Appendix C.

C. Experimental Configurations
Hardware. In all of our experiments, we used either Nvidia A100 or H100. The number of GPUs varied from 1 to 8. We
used activation offloading to lower pick memory usage. To evaluate inference speed on GPU we used consumer-grade GPU
Nvidia 3090 and for CPU setup we used Intel core i9 13900k.
Calibration set. All methods were calibrated on a slice of RedPajama-v1 dataset (Computer, 2023) for both Llama and
Mistral/Mixtral family models. We used the same context length as models were trained on, for Llama-2 4096 and for
Mistral/Mixtral 8192.
For Llama-2 experiments, we used 8M tokens as a calibration set for SpQR, GPTQ, and AQLM. Quip, however, was
calibrated on 4M tokens due to OOM errors when trying to use more samples. Taking into account the fact that after 2M
tokens improvement of methods results is fairly small we chose to report these numbers as is. For Quip#, we used Llama-2
and Mistral’s quantized models provided by authors in their GitHub repository. To the best of our knowledge, they used 6k
samples for calibration with a context length of 4096/8192.
For Mixtral we calibrated both our method and QUIP# on 8M tokens with context length 8192.
Hyperparameters.
For GPTQ for both 3 and 4 bits we used a standard set of parameters without grouping and with permutation order act_order.
SpQR method was evaluated with base 2 and 3 bit-width with group size of 16 and 3 bits for zeros and scales. Outliers rate
was chosen such that average bit will be close to 3 and 4 bits respectively.
Quip was adapted to work on the Llama family and was calibrated with 1024 samples and 4096 context length.
Quip# For Llama-2 and Mistral models we used the officially published quantized models. For Mixtral we adapted the code
to work with the model’s architecture and quantized it with the recommended set of parameters. For both AQLM and QuIP#,
we don’t quantize gate linear layer in Mixtral, because it contains relatively small amount of paramters and have severe
impact on performance.
AQLM For to get 2, 3, 4 bits: we used 1 codebook size of 215 or 216 , with groups of 8 for 2 bits. For 3 bits we used 2
codebooks size of 212 with groups of 8. Finally for 4 bits we used 2 codebooks size of 215 or 216 with groups of 8.
Both for finetuning 3.4 and codebooks update 3.3 we used Adam optimizer (Kingma & Ba, 2015) with learning rate of
10−4 , β1 = 0.90 and β2 = 0.95. We used early stopping both for the finetuning phase and for the codebook optimization
phase, by stopping when the least square error not decreasing more than some threshold. In our experiments the threshold
varies between 10−2 and 10−3 .

D. Ablation analysis
The AQLM algorithm makes several design choices that need to be validated separately: initialization, alternating optimiza-
tion, the fine-tuning protocol, and the choice of hyperparameters. Here, we study how each of these components affect
results.
Initialization. As discussed in Section 3, we initialize AQLM with residual K-means to obtain a good initial guess for both
codes and codebooks. That is, we run K-means for the weight matrix, then subtract the nearest cluster from each weight,
and run K-means again M times. A simple baseline would be to initialize all codes uniformly at random. We compare
the two initialization strategies for the problem of quantizing a single linear layer within L LAMA 2 70B model to 3 bits
per parameter. We quantize groups of 8 consecutive weights using 2 codebooks, 12 bit each. Each codebook contains 212
learnable values. As we can see in Figure 4, AQLM with K-means initialization needs significantly fewer training iterations
to achieve the desired loss. The difference is so drastic that we expect that running AQLM with a random initialization
would require extremely high runtimes to accurately quantize the largest models.
Fine-tuning. Next, we validate the fine-tuning procedure. We compare the full block fine-tuning (default) against three

13
Extreme LLM Compression via Additive Quantization

100 Random
K-Means

1
10
MSE

2
10

3
10

0 250 500 750 1000 1250 1500 1750 2000


Steps
Figure 4: MSE loss learning curves of AQLM trained on the self attention q_proj linear layer of 10-th block in the L LAMA
2 70B model.
alternatives: i) no fine-tuning at all, ii) fine-tuning only non-linear layers (i.e. RMSNorm), but not the AQ parameters, and
iii) fine-tuning only the AQ parameters, but not the non-linear layers. Table 7 summarizes our results: fine-tuning the entire
model or only AQ parameters achieves competitive performance, while training only RMSNorm scales is comparable to not
fine-tuning at all. We attribute these observations to the fact that over 99% of quantized layer parameters are contained in
AQ codebooks Cm , whereas the remaining parameters are small 1-dimensional tensors. This validates the use of the AQ
approach, as many competing algorithms do not have learnable per-layer codebooks. Notably, QuIP# uses a shared fixed
lattice instead. We also note that, even without fine-tuning, AQLM is competitive to previous state-of-the-art results.

Table 7: Ablation analysis of AQLM with different fine-tuning restrictions on Llama-2 7B model at 2.02 bit width.

Name Wiki2↓ C4↓


w/o 8.18 10.59
RMSnorm 8.31 10.46
AQ params 6.92 8.85
Full 6.93 8.84

Number of samples. We verify our choice of calibration hyperparameters. Traditionally, most PTQ algorithms use several
hundred calibration sequences (e.g. Frantar et al. (2022a) has 128). In our experiments, we evaluate both AQLM and
baselines with additional calibration data. Our original motivation for that was to avoid potential overfitting when fine-tuning
entire transformer blocks. To test this assumption, we run our algorithm with different calibration set sizes, varying from
128 to 4096 sequences. For each size, we report the average perplexity on WikiText2 over 3 runs, along with standard
deviations. The results in Table 8 demonstrate that increasing the number of samples leads to gradual reduction in perplexity
with seemingly diminishing returns. Since the perplexity is still monotonically improving from 128 to 4096 samples, it is
possible that larger sample sizes would yield further improvements.
Number of codebooks vs groups. Finally, we conducted an additional set of experiments on LLama-2 7B models to see
perplexity dependence on simultaneous change of both codebooks and groups keeping compression rate fixed to 2bits. 9
We can see that the simultaneous increase in both the number of codebooks and groups led to a decrease in average perplexity
and stopped when the number of groups reached 64.

14
Extreme LLM Compression via Additive Quantization
Table 10: Evaluation of quantized L LAMA 2 models for 4+ bits per parameter. The table reports perplexity on Wiki-
Text2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The Average accuracy
column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.

Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16 5.12 6.63 67.25 78.45 56.69 69.32 40.02 62.35
AQLM 4.04 5.21 6.75 67.32 78.24 55.99 70.16 41.04 62.55
GPTQ 4.00 5.49 7.20 68.19 76.61 55.44 66.20 36.77 60.64
7B
SpQR 3.98 5.28 6.87 66.93 78.35 56.10 69.11 39.68 62.17
QuIP# 4.02 5.29 6.86 66.85 77.91 55.78 68.06 39.68 61.66
AQLM 5.02 5.16 6.68 67.40 78.29 56.53 68.94 39.93 62.22
– 16 4.57 6.05 69.61 78.73 59.72 73.27 45.56 65.38
AQLM 3.94 4.65 6.14 69.85 78.35 59.27 73.32 44.80 65.12
GPTQ 4 4.78 6.34 70.01 77.75 58.67 70.45 42.49 63.87
13B
SpQR 3.98 4.69 6.20 69.69 78.45 59.25 71.21 44.52 64.42
QuIP 4.00 4.76 6.29 69.69 79.00 58.91 73.27 44.88 65.15
QuIP# 4.01 4.68 6.20 69.38 77.91 58.86 73.74 44.63 64.90
– 16 3.12 4.97 76.95 81.07 63.99 77.74 51.11 70.17
AQLM 4.14 3.19 5.03 76.48 81.50 63.69 77.31 50.68 69.93
GPTQ 4.00 3.35 5.15 75.61 81.23 63.47 76.81 49.15 69.25
70B
SpQR 3.97 3.25 5.07 76.01 81.28 63.71 77.36 49.15 69.50
QuIP 4.00 3.58 5.38 76.01 80.25 61.97 74.28 47.01 67.90
QuIP# 4.01 3.22 5.05 76.80 81.45 63.51 78.37 50.85 70.20
AQLM 3.82 3.21 5.03 76.32 80.90 63.69 77.61 50.34 69.77

Table 8: Wikitext2 PPL as a function of calibration set size for Table 9: Wikitext2 PPL as a function of from groups
Llama 2 (7B) quantized to 2.3 bits with AQLM, averaged over and number of codebook for Llama 2 (7B) quantized
3 runs. SD stands for adjusted standard deviation. with approximately 2 bits quantization.

# of samples Average PPL SD # of codebooks Groups size Average PPL


128 6.994 0.127 1 4 41.79
256 6.584 0.031 2 8 11.96
512 6.455 0.005 4 16 8.53
1024 6.353 0.008 8 32 7.83
2048 6.297 0.018 15 64 7.86
4096 6.267 0.005

E. Additional experiments
In this section we report additional experimental results for Mixtral(Jiang et al., 2024) , Mistral7B(Jiang et al., 2023) and
LLama-2 model.

E.1. Mixtral
We report the results for Mixtral(Jiang et al., 2024) MoE-type model for 3 and 4 bits in Table 11. In the 4 bit case,
performance of QuIP# and AQLM are very similar across all metrics and close to uncompressed FP16 model.

15
Extreme LLM Compression via Additive Quantization
Table 11: Evaluation of quantized Mixtral (Jiang et al., 2024) models for 3 and 4 bits per parameter. The table reports
perplexity on WikiText2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The
Average accuracy column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and
Average accuracy.

Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16.00 3.46 5.02 75.45 82.37 64.65 83.38 55.80 72.33
3-bit
AQLM 3.02 3.8 5.18 74.74 81.77 63.22 82.28 52.30 70.86
– 16.00 3.46 5.02 75.45 82.37 64.65 83.38 55.80 72.33
4-bit AQLM 3.915 3.57 5.07 74.82 81.99 64.23 83.12 54.61 71.75
QuIP# 4.000 3.60 5.08 76.56 81.99 63.92 82.62 54.78 71.97

Table 13: Evaluation of quantized Mistral7B (Jiang et al., 2023) models for 2, 3 and 4 bits per parameter: perplexity on
WikiText2 (Merity et al., 2016) and C4 (Raffel et al., 2020), as well as accuracy for zero-shot tasks. The Average accuracy
column is the mean of 5 zero-shot task accuracies. Primary metrics are Wiki2 (PPL), C4 (PPL) and Average accuracy.

Size Method Avg bits Wiki2↓ C4↓ WinoGrande↑ PiQA↑ HellaSwag↑ ArcE↑ ArcC↑ Average accuracy↑
– 16.00 4.77 5.71 73.64 80.47 61.15 78.87 49.23 68.67
2-bit AQLM 2.01 6.32 6.93 68.75 76.01 52.13 73.65 40.44 62.17
QuIP# 2.01 6.02 6.84 69.30 76.71 52.95 72.14 39.76 62.20
– 16.00 4.77 5.71 73.64 80.47 61.15 78.87 49.23 68.67
3-bit
AQLM 3.04 5.07 5.97 72.69 80.14 59.31 77.61 46.67 67.28
– 16.00 4.77 5.71 73.64 80.47 61.15 78.87 49.23 68.67
AQLM 4.02 4.89 5.81 73.80 79.71 60.27 77.86 48.21 67.97
4-bit
QuIP# 4.01 4.85 5.79 73.95 80.41 60.62 78.96 49.40 68.67

E.2. Llama-2
We show results for 4 bit quantization of the L LAMA 2 models in Table 10. We can see that AQLM outperforms other
methods in terms of perplexity and has the best or close to the best results. We also report results of perplexity for our
quantized 2x8 codebooks models in Table 12.

Table 12: Wikitext2 and C4 PPL for quantized Llama 2 models for 2x8 codebooks models.
Name Wiki2↓ C4↓
7B 7.98 10.38
70B 4.83 6.66

E.3. Mistral
Finally we evaluate AQLM and QuIP# quantization on Mistral7b (Jiang et al., 2023) model for 3 and 4 bits in Table 13. In
2bits, QuIP# slightly outperform AQLM on most benchmarks. And for 4bits setup results are very close across the board.

F. Pareto optimality
We visualize WikiText perplexity of Llama-2 7B,13B,70B models quantized with AQLM and QuIP# as plotted against
quantized weight size in bytes and report it in Figure 5. Our method outperforms QuIP# in terms of perplexity in WikiText2
across all model sizes.
Additionally, in Figure 6, we show perplexity on WikiText for AQLM method against size of quantized parameters. We can
notice that starting around 3.7 GiB of quantized weights, which correspond to 2.5 bits compression on Llama-2 13B model,
it is more advantageous to compress 13B model rather 7B model at the same model size in bytes.

G. Estimating model size


In this section, we describe how to estimate the size of the quantized model for a given codebook configuration. The
total cost of storing quantized weight comprises the codebooks, codes and per-unit scales. Specifically for a weight with
input dimension din , output dimension dout , group size g, M codebooks corresponding to B-bit codes, the total amount of

16
Extreme LLM Compression via Additive Quantization

8 AQLM 6.5 Llama-2 7B


Perplexity on WikiText2

Perplexity on WikiText2
QuIP# 6.0 Llama-2 13B
7
Llama-2 70B
5.5

6 7B
5.0

5 7B 4.5 13B

13B 4.0
4
3.5

3 70B 3.0 70B


4 8 16 32 4 8 16 32
Model size (GiB) Model size (GiB)

Figure 5: Comparison of AQLM relative to QuIP# on Figure 6: Model optimality for AQLM on LLAMA 2 7, 13,
LLAMA-2 7B, 13B, and 70B models. and 70B models.

memory required is (assuming that codebooks and scales are stored in half precision):
• codebooks: g · M · 2B · 16
• codes: dout · (din /g) · B
• scales: dout · 16
Therefore, the average bits per parameter can be computed as follows:
size in bits 16 g M 2B + dout (din /g)B + 16 dout
b̄ =
= (10)
number of parameters dout din
For example, for mlp.gate_proj layer of L LAMA 2 70B model with din = 8192, dout = 28672, quantization with
group size 8, two 8-bit codebooks the formula above yields 2.002 bits per parameter. Typically, storage cost is dominated by
the codes, whereas codebooks and scales induce small memory overhead.

H. End-to-End Inference Speed


Table 14: Text generation speed benchmark.

Llama 2 7B 13B 70B


Inference on Nvidia RTX 3090 GPU, tok/s
For quantized Llama-2 models, the same setup as in
Original (float16) 41.51 26.76 5.66 Section 4.3, we measure the time it takes to generate
AQLM (Table 1) 32.22 25.04 5.65 128 tokens from scratch with batch size 1, and report
AQLM (2×8-bit) 32.64 25.35 6.03 the average number of generated tokens per second on a
Inference on Intel i9 CPU, 8 cores, tok/s single 24GB RTX 3090 GPU, as well as Intel i9 CPU,
in Table 14.
Original (float32) 3.106 1.596 0.297
AQLM (2×8-bit) 6.961 4.180 0.966
AQLM (4×8-bit) 6.837 4.004 0.948
AQLM (8×8-bit) 5.319 3.193 0.775

I. Codebook and codes distribution


The proposed AQLM quantization method allows for large freedom in the choice of quantization lattice and ability to
represent different weight distribution. To understand how do the learned codes and codebooks look like, we visualize the
distribution of codes (how frequently given codebook vector is chosen) and the learned codebooks. Below on Figure 7
we provide a histogram of leaned codes and two leading principal components for a specific layer. One can observe that
codes are assigned uniformly across codebook vectors and codebook vectors are concentrated in some ball. This pattern is
pertinent to all linear projections inside transformer blocks.

17
Extreme LLM Compression via Additive Quantization

120000
0.04
100000
80000 0.02

PCA dim 2
Count

60000 0.00

40000 0.02

20000
0.04
0
0 5000 10000 15000 20000 25000 30000 0.04 0.02 0.00 0.02 0.04
Codebook id PCA dim 1

Figure 7: Visualization of learned codes and codebooks in layers.31.mlp.down_proj linear projection.


(Left) Codes distribution. (Right) Two leading principal components of codebook.

18

You might also like