Thanks to visit codestin.com
Credit goes to github.com

Skip to content

vtabbott/Efficient_Deep_Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 

Repository files navigation

Efficient Deep Learning

We make a distinction between functionality and efficiency. Functionality research manipulates the mathematical functions models implement (mappings from input to output), and hence the functionality they provide. Efficiency focuses on implementing the same (or very similar) function but using less resources.

We provide a spoiler for each linked paper summarising the results of the efficiency gains. Details regarding how this is achieved are left for the relevant papers.


Transformers

Optimized Kernels

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness "FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K)."
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning "These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization)."
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision "FlashAttention-2 [achieves] only 35% utilization on the H100 GPU. [...] FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0× with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6× lower numerical error than a baseline FP8 attention."

PEFT Methods

LoRA: Low-Rank Adaptation of Large Language Models
  • Freeze most model parameters, utilize low-rank decomposed matrices for adapters

  • Around 2x fast than full-parameter training

  • Memory efficient due to saving in Adam's first/second moments

  • Save disk space for storing model weights, particularly good for customization scenarios

QLoRA: Efficient Finetuning of Quantized LLMs
  • Quantize the model weights during training, further save memory consumption
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
  • Sample activated layers during training

  • Around 1.5x faster than LoRA

  • Even better than full-parameter training in instruction following tasks

Model Architecture Modifications

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  • We reduce the number of heads for K and V by a factor of $g$, repeating them to match the number of query/output heads, $h$.

  • With $g = 8$, we get strong performance and minimal inference loss. This reduces K, V parameter count and generation compute cost by a factor of 8.

About

List of papers focused on making deep learning more efficient, rather than adding new functionalities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •