From 57e47dca939bb97b8ef97b1a1ec80a495e2ae7b5 Mon Sep 17 00:00:00 2001 From: andrewor14 Date: Fri, 13 Jun 2025 12:59:22 -0700 Subject: [PATCH] Revamp README MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Major changes: - Added performance highlights to the top - Added Latest News and Quick Start section - Moved Integrations to the top - Added key features - Updated overall messaging to match TorchAO paper - Added more tags to look more official - Condensed and hid some sections with too much detail - General formatting fixes and visual improvements ## Before: Screenshot 2025-06-13 at 5 07 41 PM

## After:

--- README.md | 274 +++++++++++++++++++++++++++++------------------------- 1 file changed, 146 insertions(+), 128 deletions(-) diff --git a/README.md b/README.md index 691594a933..d269c3974e 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,133 @@ -# torchao: PyTorch Architecture Optimization +

+# TorchAO + +

+ +### PyTorch-Native Training-to-Serving Model Optimization +- Pre-train Llama-3.1-70B **1.5x faster** with float8 training +- Recover **77% of quantized perplexity degradation** on Llama-3.2-3B with QAT +- Quantize Llama-3-8B to int4 for **1.89x faster** inference with **58% less memory** + +

+ +[![](https://img.shields.io/badge/CodeML_%40_ICML-2025-blue)](https://codeml-workshop.github.io/codeml2025/) [![](https://dcbadge.vercel.app/api/server/gpumode?style=flat&label=TorchAO%20in%20GPU%20Mode)](https://discord.com/channels/1189498204333543425/1205223658021458100) +[![](https://img.shields.io/github/contributors-anon/pytorch/ao?color=yellow&style=flat-square)](https://github.com/pytorch/ao/graphs/contributors) +[![](https://img.shields.io/badge/torchao-documentation-blue?color=DE3412)](https://docs.pytorch.org/ao/stable/index.html) +[![license](https://img.shields.io/badge/license-BSD_3--Clause-lightgrey.svg)](./LICENSE) + +[Latest News](#-latest-news) | [Overview](#-overview) | [Quick Start](#-quick-start) | [Integrations](#-integrations) | [Inference](#-inference) | [Training](#-training) | [Videos](#-videos) | [Citation](#-citation) + +

+ + +## 📣 Latest News + +- [Jun 25] Our [TorchAO paper](https://codeml-workshop.github.io/codeml2025/) was accepted to CodeML @ ICML 2025! +- [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale +- [Apr 25] TorchAO is added as a [quantization backend to vLLM](https://docs.vllm.ai/en/latest/features/quantization/torchao.html) ([docs](https://docs.vllm.ai/en/latest/features/quantization/torchao.html))! +- [Mar 25] Our [2:4 Sparsity paper](https://openreview.net/pdf?id=O5feVk7p6Y) was accepted to SLLM @ ICLR 2025! +- [Jan 25] Our [integration with GemLite and SGLang](https://pytorch.org/blog/accelerating-llm-inference/) yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes +- [Jan 25] We added [1-8 bit ARM CPU kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for linear and embedding ops + +

Older news

+- [Nov 24] We achieved [1.43-1.51x faster pre-training](https://pytorch.org/blog/training-using-float8-fsdp2/) on Llama-3.1-70B and 405B using float8 training +- [Oct 24] TorchAO is added as a quantization backend to HF Transformers! +- [Sep 24] We officially launched TorchAO. Check out our blog [here](https://pytorch.org/blog/pytorch-native-architecture-optimization/)! +- [Jul 24] QAT [recovered up to 96% accuracy degradation](https://pytorch.org/blog/quantization-aware-training/) from quantization on Llama-3-8B +- [Jun 24] Semi-structured 2:4 sparsity [achieved 1.1x inference speedup and 1.3x training speedup](https://pytorch.org/blog/accelerating-neural-network-training/) on the SAM and ViT models respectively +- [Jun 24] Block sparsity [achieved 1.46x training speeedup](https://pytorch.org/blog/speeding-up-vits/) on the ViT model with <2% drop in accuracy -[Introduction](#introduction) | [Inference](#inference) | [Training](#training) | [Installation](#installation) |[Composability](#composability) | [Prototype Features](#prototype-features) | [Integrations](#integrations) | [Videos](#videos) | [For Developers](#for-developers) | [License](#license) | [Citation](#citation) +

-## Introduction -`torchao` accelerates PyTorch models with minimal code changes through advanced quantization and sparsification techniques. Optimize weights, gradients, activations, and more for both inference and training. +## 🌅 Overview -From the team that brought you the fast series +TorchAO is a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow +for AI models. TorchAO works out-of-the-box with `torch.compile()` and `FSDP2` across most HuggingFace PyTorch models. Key features include: +* Float8 [training](torchao/float8/README.md) and [inference](https://docs.pytorch.org/ao/main/generated/torchao.quantization.Float8DynamicActivationFloat8WeightConfig.html) for speedups without compromising accuracy +* [MX training and inference](torchao/prototype/mx_formats/README.md), provides MX tensor formats based on native PyTorch MX dtypes (prototype) +* [Quantization-Aware Training (QAT)](torchao/quantization/qat/README.md) for mitigating quantization degradation +* [Post-Training Quantization (PTQ)](torchao/quantization/README.md) for int4, int8, fp6 etc, with matching kernels targeting a variety of backends including CUDA, ARM CPU, and XNNPACK +* [Sparsity](torchao/sparsity/README.md), includes different techniques such as 2:4 sparsity and block sparsity + +Check out our [docs](https://docs.pytorch.org/ao/main/) for more details! + +From the team that brought you the fast series: * 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai) * 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2) * 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) * 2.7x inference speedup for FAIR’s Seamless M4T-v2 model with [seamlessv2-fast](https://pytorch.org/blog/accelerating-generative-ai-4/) -`torchao` isn't just for inference - it delivers substantial speedups at scale, from [up to 1.5x speedups](https://pytorch.org/blog/training-using-float8-fsdp2/) on 512 GPU clusters, to [1.34-1.43x speedups](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) on 2K H200 clusters with the latest `torchao.float8` rowwise -`torchao` works out-of-the-box with `torch.compile()` and `FSDP2` across most Hugging Face PyTorch models +## 🚀 Quick Start + +First, install TorchAO. We recommend installing the latest stable version: +``` +pip install torchao +``` + +

Other installation options

+ + ``` + # Nightly + pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 + + # Different CUDA versions + pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6 + pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only + + # For developers + USE_CUDA=1 python setup.py develop + ``` +

-## Inference +Quantize your model weights to int4! +``` +from torchao.quantization import Int4WeightOnlyConfig, quantize_ +quantize_(model, Int4WeightOnlyConfig(group_size=32)) +``` +Compared to a `torch.compiled` bf16 baseline, your quantized model should be significantly smaller and faster on a single A100 GPU: +``` +int4 model size: 1.25 MB +bfloat16 model size: 4.00 MB +compression ratio: 3.2 + +bf16 mean time: 30.393 ms +int4 mean time: 4.410 ms +speedup: 6.9x +``` +For the full model setup and benchmark details, check out our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html). Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)! -`torchao` delivers substantial performance gains with minimal code changes: -### Performance Highlights +## 🔗 Integrations -- **INT4 Weight-Only Quantization**: 2x throughput (180 vs 107 tokens/sec) with 60% less memory (6.88 GB vs 16.43 GB) on LLaMA-3-7B -- **Float8 Dynamic Quantization**: 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b on H100 GPU with preserved quality -- **INT4 + 2:4 Sparsity**: 2.4x throughput (226 vs 95 tokens/sec) with 80% memory reduction (5.3GB vs 16.4GB) on LLaMA-3-8B +TorchAO is integrated into some of the leading open-source libraries including: -[View detailed benchmarks](torchao/quantization/README.md) | [Learn about sparsity](torchao/sparsity/README.md) +* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865) +* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md) +* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference) +* TorchTune for our [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes +* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) +* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html) +* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341). +* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html) -### Getting Started with Quantization -Quantize any model with `nn.Linear` layers (including HuggingFace models) in just one line: +## 🔎 Inference + +TorchAO delivers substantial performance gains with minimal code changes: + +- **Int4 weight-only**: [1.89x throughput with 58.1% less memory](torchao/quantization/README.md) on Llama-3-8B +- **Float8 dynamic quantization**: [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality +- **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao/sparsity/README.md) on Llama-3-8B + +Quantize any model with `nn.Linear` layers in just one line (Option 1), or load the quantized model directly from HuggingFace using our integration with HuggingFace transformers (Option 2): #### Option 1: Direct TorchAO API @@ -61,65 +154,43 @@ quantized_model = AutoModelForCausalLM.from_pretrained( ) ``` -### Deployment with vLLM - -Deploy quantized models with one command: +#### Deploy quantized models in vLLM with one command: ```shell vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 ``` -**Benefits**: 67% VRAM reduction and 12-20% speedup on A100 GPUs while maintaining quality. - -[Step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe) | [Pre-quantized models](https://huggingface.co/pytorch) +With this quantization flow, we achieve **67% VRAM reduction and 12-20% speedup** on A100 GPUs while maintaining model quality. For more detail, see this [step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe). We also release some pre-quantized models [here](https://huggingface.co/pytorch). -## Training +## 🚅 Training -### Quantization Aware Training +### Quantization-Aware Training -Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with [Torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md#quantization-aware-training-qat), we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/). For more details, please see the [QAT README](./torchao/quantization/qat/README.md). +Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization-Aware Training (QAT) to overcome this limitation, especially for lower bit-width dtypes such as int4. In collaboration with [TorchTune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md#quantization-aware-training-qat), we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the [QAT README](torchao/quantization/qat/README.md) and the [original blog](https://pytorch.org/blog/quantization-aware-training/): ```python -from torchao.quantization import ( - quantize_, - Int8DynamicActivationInt4WeightConfig, -) -from torchao.quantization.qat import ( - FakeQuantizeConfig, - FromIntXQuantizationAwareTrainingConfig, - IntXQuantizationAwareTrainingConfig, -) - -# Insert fake quantization +from torchao.quantization import quantize_ +from torchao.quantization.qat import FakeQuantizeConfig, IntXQuantizationAwareTrainingConfig activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) -quantize_( - my_model, - IntXQuantizationAwareTrainingConfig(activation_config, weight_config), -) +qat_config = IntXQuantizationAwareTrainingConfig(activation_config, weight_config), +quantize_(my_model, qat_config) +``` -# Run training... (not shown) +Users can also combine LoRA + QAT to speed up training by [1.89x](https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700) compared to vanilla QAT using this [fine-tuning recipe](https://github.com/pytorch/torchtune/blob/main/recipes/qat_lora_finetune_distributed.py). -# Convert fake quantization to actual quantized operations -quantize_(my_model, FromIntXQuantizationAwareTrainingConfig()) -quantize_(my_model, Int8DynamicActivationInt4WeightConfig(group_size=32)) -``` ### Float8 -[torchao.float8](torchao/float8) implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433 - -With ``torch.compile`` on, current results show throughput speedups of up to **1.5x on up to 512 GPU / 405B parameter count scale** ([details](https://pytorch.org/blog/training-using-float8-fsdp2/)) +[torchao.float8](torchao/float8) implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433. With ``torch.compile`` on, current results show throughput speedups of up to **1.5x on up to 512 GPU / 405B parameter count scale** ([details](https://pytorch.org/blog/training-using-float8-fsdp2/)): ```python from torchao.float8 import convert_to_float8_training -convert_to_float8_training(m, module_filter_fn=...) +convert_to_float8_training(m) ``` -And for an end-to-minimal training recipe of pretraining with float8, you can check out [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md). - -#### Blog posts about float8 training - +Our float8 training is integrated into [TorchTitan's pre-training flows](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) so users can easily try it out. For more details, check out these blog posts about our float8 training support: +* [Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) * [Supercharging Training using float8 and FSDP2](https://pytorch.org/blog/training-using-float8-fsdp2/) * [Efficient Pre-training of Llama 3-like model architectures using torchtitan on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/efficient-pre-training-of-llama-3-like-model-architectures-using-torchtitan-on-amazon-sagemaker/) * [Float8 in PyTorch](https://dev-discuss.pytorch.org/t/float8-in-pytorch-1-x/1815) @@ -127,13 +198,10 @@ And for an end-to-minimal training recipe of pretraining with float8, you can ch ### Sparse Training -We've added support for semi-structured 2:4 sparsity with **6% end-to-end speedups on ViT-L**. Full blog [here](https://pytorch.org/blog/accelerating-neural-network-training/) - -The code change is a 1 liner with the full example available [here](torchao/sparsity/training/) +We've added support for semi-structured 2:4 sparsity with **6% end-to-end speedups on ViT-L**. Full blog [here](https://pytorch.org/blog/accelerating-neural-network-training/). The code change is a 1 liner with the full example available [here](torchao/sparsity/training/): ```python from torchao.sparsity.training import SemiSparseLinear, swap_linear_with_semi_sparse_linear - swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear}) ``` @@ -141,8 +209,7 @@ swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear}) Optimizers like ADAM can consume substantial GPU memory - 2x as much as the model parameters themselves. TorchAO provides two approaches to reduce this overhead: -1. **Quantized optimizers**: Reduce optimizer state memory by 2-4x by quantizing to lower precision - +**1. Quantized optimizers**: Reduce optimizer state memory by 2-4x by quantizing to lower precision ```python from torchao.optim import AdamW8bit, AdamW4bit, AdamWFp8 @@ -150,7 +217,7 @@ optim = AdamW8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for th ``` Our quantized optimizers are implemented in just a few hundred lines of PyTorch code and compiled for efficiency. While slightly slower than specialized kernels, they offer an excellent balance of memory savings and performance. See detailed [benchmarks here](https://github.com/pytorch/ao/tree/main/torchao/optim). -2. **CPU offloading**: Move optimizer state and gradients to CPU memory +**2. CPU offloading**: Move optimizer state and gradients to CPU memory For maximum memory savings, we support [single GPU CPU offloading](https://github.com/pytorch/ao/tree/main/torchao/optim#optimizer-cpu-offload) that efficiently moves both gradients and optimizer state to CPU memory. This approach can **reduce your VRAM requirements by 60%** with minimal impact on training speed: @@ -159,32 +226,10 @@ optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True) optim.load_state_dict(ckpt["optim"]) ``` -## Installation - -`torchao` makes liberal use of several new features in PyTorch, it's recommended to use it with the current nightly or latest stable version of PyTorch, see [getting started](https://pytorch.org/get-started/locally/) for more details. - -Install the stable release (recommended): -```bash -pip install torchao -``` - -Other options: -```bash -# Nightly build -pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu124 - -# Different CUDA versions -pip install torchao --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8 -pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only - -``` - -### Development Install -``` -USE_CPP=0 python setup.py develop # Skip C++/CUDA extensions -``` + -## License +## 🎥 Videos +* [Keynote talk at GPU MODE IRL](https://youtu.be/FH5wiwOyPX4?si=VZK22hHz25GRzBG1&t=1009) +* [Low precision dtypes at PyTorch conference](https://youtu.be/xcKwEZ77Cps?si=7BS6cXMGgYtFlnrA) +* [Slaying OOMs at the Mastering LLM's course](https://www.youtube.com/watch?v=UvRl4ansfCg) +* [Advanced Quantization at CUDA MODE](https://youtu.be/1u9xUK3G4VM?si=4JcPlw2w8chPXW8J) +* [Chip Huyen's GPU Optimization Workshop](https://www.youtube.com/live/v_q2JTIqE20?si=mf7HeZ63rS-uYpS6) +* [Cohere for AI community talk](https://www.youtube.com/watch?v=lVgrE36ZUw0) -`torchao` is released under the [BSD 3](https://github.com/pytorch-labs/ao/blob/main/LICENSE) license. -# Citation +## 💬 Citation If you find the torchao library useful, please cite it in your work as below. + ```bibtex @software{torchao, - title = {torchao: PyTorch native quantization and sparsity for training and inference}, - author = {torchao maintainers and contributors}, - url = {https://github.com/pytorch/torchao}, - license = {BSD-3-Clause}, - month = oct, - year = {2024} + title={TorchAO: PyTorch-Native Training-to-Serving Model Optimization}, + author={torchao}, + url={https://github.com/pytorch/torchao}, + license={BSD-3-Clause}, + month={oct}, + year={2024} } ```