Machine Learning Systems With Reduced Memory Requirements
Machine Learning Systems With Reduced Memory Requirements
Requirements
Franklin Huang
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Acknowledgement
by
in
in the
Graduate Division
of the
Committee in charge:
Spring 2024
The Research Project of Hongyi (Franklin) Huang, titled Machine Learning Systems with
Reduced Memory Requirements, is approved:
Copyright 2024
by
Hongyi (Franklin) Huang
1
Abstract
by
Machine learning systems today are developing in two opposites that require an increasing
amount of hardware awareness to make productization feasible. On one hand, a stable scaling
law in large language lodels (LLMs) pushes the system scale to be bigger; on the other hand,
robotics and wearable applications require neural networks to fit in small systems that have
an extremely low amount of compute, memory, and power budget.
Even though Moore’s Law and architectural innovations have been sustaining compute per-
formance growth, there is a widening processor-memory gap that requires imminent inno-
vation. While anticipated hardware advancements such as GDDR7, HBM4, and UCIe are
expected to alleviate this gap, challenges in the memory hierarchy will persist. Therefore,
it is crucial to design kernels that enhance inference throughput by efficiently utilizing and
managing the memory hierarchy.
This technical report explores improvements in the compression of a broad range of language
models and compares them to the state-of-the-art. While quantization benefits all models,
this thesis finds that sparsity and entropy methods are particularly effective for smaller
models, reducing the bitrate to as low as 1.96 bits per weight with minimal accuracy loss.
In contrast, larger language models derive greater advantages from enhanced data reuse and
page-based memory management techniques.
Specifically, in CodeGen applications where parallel sampling enhances accuracy, these strate-
gies have demonstrated the potential to reduce the memory capacity and bandwidth re-
quirements of attention kernels by 15x. When evaluating problem-solving capacity, parallel
sampling effectively matches the capabilities of a single sampled larger model with a tenfold
reduction in memory and parameter count. These achievements unlock the possibility of
local deployment for both real-time embedded systems and language model applications.
i
And to those who share my passion for making technology more human.
ii
Contents
Contents ii
List of Tables iv
1 On-Device TinyML 1
1.1 Quantization Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Memory Traffic & Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Single Batch Small Language Model Inference . . . . . . . . . . . . . . . . . 4
4 Conclusion 28
Bibliography 29
iii
List of Figures
List of Tables
Acknowledgments
I’d like to acknowledge many people who have continuously assisted me in my short year
to complete this thesis, without them none of this would happen. First and foremost my
research advisor Professor Nikolic, second reader Professor Shao, and also PhD candidate
Coleman Hooper for helping to scope achievable topics. Additionally, staff and Professors on
CS 252A, EECS 251B, and EE C249A helped me discover the complexities of how algorithms
interact with every level of hardware. Most importantly, Tapeout and Bringup members of
BearlyML22 & 23 for designing a homegrown chip that can run a small language model
for the first time. Also, Gert Christen and members of TensorZipper whom have found
problems and challenges in making a startup on embedded system machine learning, even
though we eventually realized the product would not be very scalable and pivoted. Finally,
all members of SLICE and BWRC lab whom we relied on consultation of Chipyard and PCB
analog magic matters. Special shoutout to Yufeng Chi, Edison Wang, and Richard Yan for
the ‘Emeryville Satellite Campus’ of SLICE.
1
Chapter 1
On-Device TinyML
TinyML fits small machine learning systems into mobile or even embedded systems for on-
device real-time inference. These are often highly quantized models with negligible training
costs but need to run in tight deadlines or power budgets. The primary difficulty of these
systems lies in writing kernels that compute efficiently while minimizing off-chip bandwidth.
This chapter discusses basic quantization methods in ML and how more advanced opti-
mizations can achieve speedups in mobile and embedded systems.
Table 1.1: Common fixed point storage format for quantized numbers.
int16 1:7:8 [1] Sign [7] Integer [8] Fraction
int8 1:3:4 [1] Sign [3] Integer [4] Fraction
int4 1:3:0 [1] Sign [3] Integer [0] Fraction
CHAPTER 1. ON-DEVICE TINYML 2
Z = [max(X) − min(X)]/2
Q(x, s, z) = saturate(x/s + z, A, B)
Note that the scaling factor and zeros here are determined as a whole for the entire matrix
or tensor. This will only work for small models, larger models as explained in Chapter 3 will
require quantization every 64 groups or so to deal with outliers.
the same quantization-aware training paper [1] finds that M can be expressed into a simple
fixed point and shift.
SW SAl
M= = 2−n M0 = M0 ≫ n
SAl+1
This insight allows for a kernel fusion that quickly reduces the dot product down to
the next layer in one pass of memory with nothing but fixed point multiplier, shifters, and
adders. Which are commonly accessible and efficient in any ISA or architecture.
N
(i,k) X (i,j) (j,k)
Al+1 = ReLU 6([M0 ∗ Wl Al ] ≫ n)
j=1
Note that since everything is now in fixed-point, using activation function ReLU6 in the
model (a ReLU function with a ceil of 6) prevents activation quantization error caused by
overflow and reduces accuracy degradation when porting to a fully quantized kernel. ReLU6
is also noted to be helpful in previous fixed-point quantization works [2], and experimentally,
int4 and int8 weights achieve < 1% accuracy loss on MNIST.
7 GFLOPS when coded optimally, but all of this depends on a tiny QSPI RAM and flash
for memory access. While switching to convolution would turn the problem into compute-
bound, many models based on attention or diffusion still rely entirely on feed-forward layers.
Hence the motivation for Section 1.3 and Chapter 2 is to find a way to reduce memory
bandwidth incurred by reading weights.
(a) BearlyML 2022 Cycle Benchmarks. (b) Typical Embedded System Architecture.
• Speculative decoding to increase batch size by using a small model to look ahead and
guess.
• Activation sparsity induction by using ReLU activations in the perceptron layers, which
requires network fine-tuning.
As language models only predict similar outputs when parameter counts are sufficiently
large, here we explore activation sparsity induction to double throughput effectively by ig-
noring weight reads when activations are zeroed.
CHAPTER 1. ON-DEVICE TINYML 5
Figure 1.1: Left: Sparse compute kernel. Right: Relufied Llama architecture.
From experiments by attempting to fine-tune the Llama 2 7B model and also retrain/fine-
tune a Llama 260k model, the following non-trivial drawbacks were observed which raises
concerns:
– Convergence stagnates within 30-100 iterations when utilizing LoRA [8], only
full rank training works which is extremely memory intensive. Possible recent
performance-efficient fine-tuning mitigations include using ReLoRA [9] or GaLore
[10].
– As batch size increases, aggregate sparsity across batches reduces, eventually lead-
ing to reading in ≈ 73% of weights when B = 64 (See Table 1.3). The only utility
of the sparsity-induced fine-tuned model is single batch inference, which may not
justify the fine-tuning cost.
– Fine-tuning the base model to induce sparsity requires the recipe for the dataset
and must be drawn from that distribution to avoid knowledge collapse. There
are also no known procedures for how to select a dataset if the model we are
fine-tuning is not a base model, but also a fine-tuned model of a base model such
as CodeLlama.
• For Llama 260k tinystories model, with density % on Table 1.2 and 1.3:
Model Type Llama 260k (TinyStories) Relu strikes back Llama 7B OPT 220k
Act Type relu relu(x-.25) relu(x-1) relu relu(x-1) relu
D Proj 46% 23% 0% 35% 3% 32.15%
QKV 60% 27% 25% 49% ? 49.77%
FFN 63% 32% 12% 33% ? 52.01%
Val Loss 2.155 2.079 2.383 - - 2.083
Train Loss 2.158 2.080 2.382 - - 2.085
Table 1.3: Non-zero activation % across consecutive tokens for llama 260k (TinyStories).
Chapter 2
Experiments show that lossless compression algorithm could be applied to these small models
to reduce up to 51% of the original storage in int4, achieving an effective int2 bitrate. There
are caveats though, including:
• Lossless compression is only useful for small models and is harder to utilize for large
models that have occasional outliers.
2.1 Overview
Quantization is a lossy way to compress bits by rounding. Its average error can be determined
by avg(abs(X − Q(X))). Lossless compression in comparison takes advantage of the uneven
statistical distribution of quantized symbols to further reduce bits per symbol, its theoretical
limit is the entropy of symbols [11]. Given s ∈ S symbols, the average bits per symbol in a
distribution is equivalent to Ss P (s)log2 (1/P (s)), where P (s) is the probability of symbol
P
occurrence.
Huffman [12], Arithmetic Coding (AC) [13], and Asymmetrical Numerical Systems (ANS)
[14] are three of such lossless compression algorithms, essentially using more bits to represent
rare symbols and fewer bits to represent frequent symbols to achieve compression in aggre-
gate. The more skewed the distributions are, the better the compression rate. Huffman is
widely used in a variety of compression algorithms as it is simple to decompress and is highly
effective when probabilities are extremely skewed. AC and ANS both achieve near Shannon
optimal compression ratio by packing fractional bit information into a streaming range of
numbers, thereby improving over Huffman. AC however, is computationally intense and
slow beyond binary symbols; while tANS, when converted to a simple lookup table achieve
fractional bit packing at a throughput similar to that of Huffman.
CHAPTER 2. ENTROPY-BASED LOSSLESS COMPRESSION 9
Note that even though entropy-based compression algorithms are lossless by themselves,
it is always applied after quantization to further squeeze the bitrate of compressed objects.
Hence, entropy-based methods are not useful by themselves but are used in conjunction with
quantization to put the accuracy-bitrate tradeoff to a near Shannon optimal level.
must occur for lookup tables to output more symbols at once. The next section 2.5 outlines a
more serious problem that makes entropy methods only gain minor improvements for current
large language models without quantization-aware training methods.
Table 2.2: Single-core speed benchmark and overhead relative to reading in quantized bits
from Flash or DRAM. On M1 silicon, a state-of-the-art 1.5-1.8 Giga int4 Symbols/s is
achieved per core. Alternatives including branchless programming and CPU vectorization
were attempted but did not yield speedups.
Chapter 3
3.1 Introduction
In this chapter, we evaluate CodeLlama [20] using the Mostly Basic Python Programming
(MBPP) dataset [21] and introduce novel algorithms that optimize trade-offs between accu-
racy, runtime, and device constraints. To the best of my knowledge, this is the first study
to explore the effects of PagedAttention [22] on CodeGen. In this application, parallelism
can be substantial, and branching unfolds gradually rather than abruptly. Through page
management of KV-cache, an economy of scale is achieved where parallel sampling by a
factor of 100 is, on average, only 6.6 times more resource-intensive than single sampling on
the MBPP dataset. Combined with data from CodeLlama showing that a 100x sampled 7B
model achieves the problem-solving performance of a single sampled 70B model, we have ef-
fectively matched the problem-solving capabilities of a larger model with a tenfold reduction
in memory requirements. This achievement opens up the potential for local deployment.
# Write a function to find the shared elements from the given two lists.
# Test cases:
assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))
assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))
assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))
# Solution
def similar_elements(test_tup1, test_tup2):
res = tuple(set(test_tup1) & set(test_tup2))
return (res)
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 17
1. Softmax within the attention layer causes occasional outliers in both activations and
weights, significantly increasing the dynamic range of the signals.
2. Querying past information from KV-Cache in the attention mechanism incurs off-chip
memory traffic, which does not scale efficiently with batch size during inference when
the histories of tokens are independent.
We start by addressing how to quantize large tensors with outliers through group quan-
tization in section 3.3. Section 3.4 and 3.5 will explore methods to reduce memory traffic of
weights and reduce KV-Cache footprint in the attention mechanism when batch size is large
when exploiting application-specific parallelism.
PyTorch causes excessive traffic as it writes intermediate shift and scale operations into
DRAM, making two passes of the data in memory traffic. A simple custom kernel in Triton
fuses the operators in one pass. Note that quantization is always slower as finding the max
scale takes extra effort, for PyTorch this is especially so as it occurs an extra pass in DRAM.
Figure 3.2: Custom fused kernel performances compared to naive PyTorch implementation,
benchmarked on RTX 4080.
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 20
Figure 3.3: KV-cache and inference batch size, comparing naive, shared prompt, and paging.
However, the attention mechanism traditionally scales poorly even with large batches.
Due to its need to recall its own context as demonstrated in Figure 3.4, KV-cache memory
footprint and traffic usually scale linearly by batch size. By analyzing the dependencies
graph for this specific application, we find that paged based attention kernel can effectively
unlock parallelism and sequence-dependent memory optimizations allowing for the reuse and
amortization of shared contexts. See Figure 3.3 for a visual example.
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 21
Figure 3.4: Weight can achieve memory reuse through high batch, but KV-cache traditionally
does not.
eyi /T
sof tmax(yi ) = Pn yj /T
j=1 e
(a) Effective Tokens/s without Particles (b) Effective Tokens/s with Particles
Figure 3.5: Blue: Tokens/s produced, Orange: Tokens/s used in the program (eliminating
special tokens and trajectories that ended early).
spent already. In Figure 3.6, higher temperature @1 shows a deteriorating pass rate. It is
hence reasonable to reduce temperature if there are limited particles left on this path, forcing
a good solution to converge given a limited budget.
A simple linear relation between (1, Tmin ) and (Bmax , Tmax ) effectively maintains sample
diversity while not spending excessive time to converge on a good solution. In Fig 3.6, lines
with Annealing consistently show on par pass rate while spending less time, and Tmin = 0.1
while the originally given temperature becomes Tmax .
Figure 3.6: Llama-7B on MBPP, 0-shot prompt pass rates @1, @10, @100 with varying
temperatures, top-k=32. The annealing schedule is defined below. For data see Tables 3.1
and 3.2.
only 20% of the batches are active on average, only 0.7GB is needed, which is only 20% of
the 3.5GB of weights. This allows us to run CodeLlama 7B with max prompt length of 512
tokens, a max code length of 256 tokens, and a paged batch size of 100 with ease on a 2080Ti
with only 11GB of VRAM.
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 24
Temperature Pass @1 Pass @10 Pass @100 Pass@10 Anneal Pass@100 Anneal
0.1 7.50 8.15 8.77
0.4 7.75 9.67 14.07
0.7 8.23 12.48 21.22 10.57 16.86
1.0 7.58 12.79 27.72 10.78 23.23
1.3 14.95 37.09 11.18 21.12
1.6 14.24 63.09 11.59 21.05
Figure 3.7: Average batch size over time: on average, memory and compute reduction equals
the area above the orange line till y=100.
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 25
• FlashAttention [27], utilizing the online softmax mechanism to fuse the softmax kernel
with matrix multiply in one pass.
• ChunkAttention [28], illustrated in Figure 3.5.1, explicitly allowing the shared prompt
to achieve an extra degree of parallelism to fully saturate tensor core throughput.
• A novel TreeAttention kernel, illustrated in Figure 3.5.1 saving DRAM traffic for fine-
grained branching of KV-cache by making sure repeated elements are resident in L2
cache of GPU.
Upon synthetic benchmarks shown in Figure 3.9, we find that ChunkAttention indeed
achieves a speedup proportional to the percentage of shared prompt in the entire context,
here is 50%. Furthermore, TreeAttention can further reduce 15.5% RAM traffic when a
branching factor of 1 occurs every time step. For modern GPUs that can compute much
faster than memory bandwidth, we observe that a TreeAttention-only kernel can achieve a
speedup against a ChunkAttention-only kernel without even needing to explicitly parallelize
the shared prompt.
On the MBPP dataset specifically, we were not able to integrate these kernels in time,
but based on prompt and solution length ratio of 2:1, the synthetic benchmarks suggest on
average at least a threefold acceleration of the attention kernel. This is left for future work.
Cumulatively, the 5x reduction of batch size on average through particle sampling and 3x
reduction of KV-cache footprint through page management reduces memory footprint and
bandwidth by 15x of the original KV-cache size, making scaling by parallel batches more
economical.
1
Kernels are open sourced at https://github.com/hongyihuang/spec-mcts/blob/main/triton kernels.py
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 26
Figure 3.8: ChunkedAttention (blue), shared prompt only comparing against FlashAttention
(green).
Figure 3.9: ChunkAttention (with non-shared KV-Cache), tested with half shared-prompt
and half divergent-context. TreeAttention is tested with an extra branch of 1 per time step
along with half shared-prompt.
CHAPTER 3. CODEGEN: PARALLELISM IN LANGUAGE MODELS 27
(a) Chunk attention kernel illustration. (b) Tree attention kernel illustration.
28
Chapter 4
Conclusion
In summary, this thesis adds to the field the use of entropy-based compression and acti-
vation sparsity for small language models that fit on embedded systems, which both can
potentially double inference performance compared to simple int4 quantization. We demon-
strate that asymmetrical numerical systems can optimally compress weights for models with
less dynamic range with minimum overhead, achieving 2 bits per weight bitrate. Addition-
ally, activation sparsity can be induced via inserting ReLU activations during training, and
skipping weight reads for sparse activations can also reduce 50% of off-chip bandwidth.
In contrast, large language models have weights of large dynamic range. Parallel sam-
pling, paging-based memory management, and novel attention kernel for specific applications
such as code generation to achieve performance equal to tenfold larger models and 15x less
KV-cache footprint. This thesis contributes by discovering the effects of paging-based mem-
ory management on CodeGen tasks specifically and adding a novel tree attention kernel that
can fuse the attention layer arithmetic and optimally share KV-cache traffic.
These methods allow for effective local or distributed deployment in both embedded
systems and language model applications that are real-time or privacy critical. Most im-
portantly, for models of any scale, writing fused kernels that avoid extra memory passes
between multiple operations is demonstrated to be the most reliable way to obtain speedup.
Compilers that can automatically fuse rudimentary operators without needing to manually
program may prove useful.
There are still many future research avenues for these three methods. In the case of
activation sparsity, solving the various challenges of scaling up the sparse model remains an
open problem discussed in Section 1.3.2. While entropy-based compression is only useful
for small models or non-transformer models, a speedup can be realized when a hardware
decompression unit is implemented to minimize resource consumption. We also await the
open-source of any BitNet or Ternary LLMs for further experimentation of entropy-based
compression. Finally, for large language models in applications that benefit from parallel
sampling, more sophisticated planning and search that resembles tree search AlphaGo may
improve performance further against larger models.
29
Bibliography
[1] Benoit Jacob et al. “Quantization and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference”. In: CoRR abs/1712.05877 (2017). arXiv: 1712 . 05877.
url: http://arxiv.org/abs/1712.05877.
[2] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient in-
ference: A whitepaper. 2018. arXiv: 1806.08342 [cs.LG].
[3] Hao Wu et al. Integer Quantization for Deep Learning Inference: Principles and Em-
pirical Evaluation. 2020. arXiv: 2004.09602 [cs.LG].
[4] Keivan Alizadeh et al. LLM in a flash: Efficient Large Language Model Inference with
Limited Memory. 2024. arXiv: 2312.11514 [cs.CL].
[5] Iman Mirzadeh et al. ReLU Strikes Back: Exploiting Activation Sparsity in Large Lan-
guage Models. 2023. arXiv: 2310.04564 [cs.LG].
[6] Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023.
arXiv: 2307.09288 [cs.CL].
[7] Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and
Still Speak Coherent English? 2023. arXiv: 2305.07759 [cs.CL].
[8] Edward J. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. 2021.
arXiv: 2106.09685 [cs.CL].
[9] Vladislav Lialin et al. ReLoRA: High-Rank Training Through Low-Rank Updates. 2023.
arXiv: 2307.05695 [cs.CL].
[10] Jiawei Zhao et al. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank
Projection. 2024. arXiv: 2403.03507 [cs.LG].
[11] C. E. Shannon. “A mathematical theory of communication”. In: The Bell System
Technical Journal 27.3 (1948), pp. 379–423. doi: 10 . 1002 / j . 1538 - 7305 . 1948 .
tb01338.x.
[12] David A. Huffman. “A Method for the Construction of Minimum-Redundancy Codes”.
In: Proceedings of the IRE 40.9 (1952), pp. 1098–1101. doi: 10.1109/JRPROC.1952.
273898.
[13] G. G. Langdon. “An Introduction to Arithmetic Coding”. In: IBM Journal of Research
and Development 28.2 (1984), pp. 135–149. doi: 10.1147/rd.282.0135.
BIBLIOGRAPHY 30
[14] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of Huffman
coding with compression rate of arithmetic coding. 2014. arXiv: 1311.2540 [cs.IT].
[15] Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compressing Deep
Neural Networks with Pruning, Trained Quantization and Huffman Coding. 2016. arXiv:
1510.00149 [cs.CV].
[16] Sehoon Kim et al. SqueezeLLM: Dense-and-Sparse Quantization. 2024. arXiv: 2306.
07629 [cs.CL].
[17] Coleman Hooper et al. KVQuant: Towards 10 Million Context Length LLM Inference
with KV Cache Quantization. 2024. arXiv: 2401.18079 [cs.LG].
[18] Shuming Ma et al. The Era of 1-bit LLMs: All Large Language Models are in 1.58
Bits. 2024. arXiv: 2402.17764 [cs.CL].
[19] Elias Frantar and Dan Alistarh. QMoE: Practical Sub-1-Bit Compression of Trillion-
Parameter Models. 2023. arXiv: 2310.16795 [cs.LG].
[20] Baptiste Rozière et al. “Code Llama: Open Foundation Models for Code”. In: Meta
AI (2023).
[21] Jacob Austin et al. “Program Synthesis with Large Language Models”. In: arXiv
preprint arXiv:2108.07732 (2021).
[22] Woosuk Kwon et al. Efficient Memory Management for Large Language Model Serving
with PagedAttention. 2023. arXiv: 2309.06180 [cs.LG].
[23] Reiner Pope et al. Efficiently Scaling Transformer Inference. 2022. arXiv: 2211.05102
[cs.LG].
[24] Ashish Vaswani et al. Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL].
[25] Bita Darvish Rouhani et al. Microscaling Data Formats for Deep Learning. 2023. arXiv:
2310.10537 [cs.LG].
[26] Philippe Tillet, H. T. Kung, and David Cox. “Triton: an intermediate language and
compiler for tiled neural network computations”. In: Proceedings of the 3rd ACM SIG-
PLAN International Workshop on Machine Learning and Programming Languages.
MAPL 2019. Phoenix, AZ, USA: Association for Computing Machinery, 2019, pp. 10–
19. isbn: 9781450367196. doi: 10.1145/3315508.3329973. url: https://doi.org/
10.1145/3315508.3329973.
[27] Tri Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-
Awareness. 2022. arXiv: 2205.14135 [cs.LG].
[28] Lu Ye et al. ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing
and Batching. 2024. url: https://openreview.net/forum?id=9k27IITeAZ.