The official implementation of SpargeAttn, a universal sparse attention accelerating language, image, and video models.
- python>=3.9,- torch>=2.3.0
- CUDA:- >=12.8for Blackwell
- >=12.4for fp8 support on Ada
- >=12.3for fp8 support on Hopper
- >=12.0for Ampere
 
pip install ninja   # for parallel compilation
python setup.py install   # or pip install -e .- 
spas_sage2_attn_meansim_cuda: SpargeAttn based on SageAttention2.
- 
spas_sage_attn_meansim_cuda: SpargeAttn based on SageAttention.
Tuning:
# sequential tuning
python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune
# parallel tuning, this will use all gpu available on the machine 
python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune --parallel_tuneInference:
# `--compile` is optional and will slow the first time inference.
python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --compileNote: We provide pre-tuned hyper-parameters
CogVideoX-2b_0.06_0.07.ptthat allow running the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.
Note:
--compileis optional and will further accelerate video generation but bring an overhead for the first video generation.
The tuning and inference usage is similar to CogVideoX.
Here’s a list of the tuned models so far, go to hugginface to see all tuned ckpt. Our approach is universal, and we warmly welcome contributions! Feel free to submit a pull request to support more models. 🚀
| model name | example script | tuned ckpt | 
|---|---|---|
| CogVideoX-2b | evaluate/cogvideo_example.py | link | 
| want2v-1.3B | evaluate/wan_example.py | link | 
| Flux | evaluate/flux_example.py | TBD | 
Note: All experiments in the above Table and our paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.
| The quality of video generation on Mochi. | End-to-end performance of NIAH. | 
If you use this code or find our work valuable, please cite:
@misc{zhang2025spargeattn,
      title={SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference}, 
      author={Jintao Zhang and Chendong Xiang and Haofeng Huang and Jia Wei and Haocheng Xi and Jun Zhu and Jianfei Chen},
      year={2025},
      eprint={2502.18137},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.18137}, 
}
@inproceedings{zhang2025sageattention,
      title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
      author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
      booktitle={International Conference on Learning Representations (ICLR)},
      year={2025}
}
@misc{zhang2024sageattention2,
      title={SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization}, 
      author={Jintao Zhang and Haofeng Huang and Pengle Zhang and Jia Wei and Jun Zhu and Jianfei Chen},
      year={2024},
      eprint={2411.10958},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.10958}, 
}