Efficient Deep Learning

We make a distinction between functionality and efficiency. Functionality research manipulates the mathematical functions models implement (mappings from input to output), and hence the functionality they provide. Efficiency focuses on implementing the same (or very similar) function but using less resources.

We provide a spoiler for each linked paper summarising the results of the efficiency gains. Details regarding how this is achieved are left for the relevant papers.

Transformers

Optimized Kernels

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

"FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K)."

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

"These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization)."

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

"FlashAttention-2 [achieves] only 35% utilization on the H100 GPU. [...] FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0× with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6× lower numerical error than a baseline FP8 attention."

PEFT Methods

LoRA: Low-Rank Adaptation of Large Language Models

Freeze most model parameters, utilize low-rank decomposed matrices for adapters
Around 2x fast than full-parameter training
Memory efficient due to saving in Adam's first/second moments
Save disk space for storing model weights, particularly good for customization scenarios

QLoRA: Efficient Finetuning of Quantized LLMs

Quantize the model weights during training, further save memory consumption

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

Sample activated layers during training
Around 1.5x faster than LoRA
Even better than full-parameter training in instruction following tasks

Model Architecture Modifications

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

We reduce the number of heads for K and V by a factor of $g$, repeating them to match the number of query/output heads, $h$.
With $g = 8$, we get strong performance and minimal inference loss. This reduces K, V parameter count and generation compute cost by a factor of 8.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Efficient Deep Learning

Transformers

Optimized Kernels

PEFT Methods

Model Architecture Modifications

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

vtabbott/Efficient_Deep_Learning

Folders and files

Latest commit

History

Repository files navigation

Efficient Deep Learning

Transformers

Optimized Kernels

PEFT Methods

Model Architecture Modifications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages