SparseD: Sparse Attention for Diffusion Language Models 🥯[Arxiv]
Zeqing Wang1, Gongfan Fang1, Xinyin Ma1 , Xingyi Yang2 , Xinchao Wang1
1 xML Lab, National University of Singapore
2 The Hong Kong Polytechnic University
SparseD is a novel sparse attention method for diffusion language models (DLMs), delivering near lossless acceleration in performance. It uses full attention and computes sparse patterns during early denoising steps, then reuses these patterns in later steps to restrict computation and improve efficiency. Extensive experiments show that SparseD greatly maintains accuracy on the evaluated benchmarks while achieving up to
conda create -n SparseD python=3.10
conda activate SparseD
pip install -r requirements.txt
# For Dream Model
python dream_generation.py --origin
python dream_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python dream_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context
# For LLaDA Model
python llada_generation.py --origin
python llada_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python llada_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context
Arguments:
--model_path
: The model path, e.g., Dream-org/Dream-v0-Instruct-7B and GSAI-ML/LLaDA-1.5 .--seq_len
,--steps
,--block_length
,--sampling-alg
: The inference configuration for diffusion generation.block_length
refers specifically to the LLaDA model.--origin
: Inference by original model.--skip
,--select
,--block_size
: The inference configuration for SparseD.skip
denotes the ratio of full attention applied to the earlier steps across all denoising steps,select
denotes the selection ratio for sparse attention, andblock_size
specifies the block size used when selecting important query-key pairs.--prompt
: Choose prompt for simple test, including["short_context", "4k", "8k", "16k", "32k", "64k"]
length versions.
-
Since our sparse attention is implemented by FlexAttention, we recommend conducting a warm-up inference first, as subsequent inferences will perform better in terms of speed.
-
To better demonstrate the acceleration achieved by SparseD, we recommend evaluating it with long-context prompts, such as those with lengths of 16k, 32k, and 64k. We also provide short context for simple evaluation.
Our sparse attention is accelerated by FlexAttention, and implemented on Dream and LLaDA. We extend our gratitude to the community for their valuable contributions!
@misc{wang2025sparsedsparseattentiondiffusion,
title={SparseD: Sparse Attention for Diffusion Language Models},
author={Zeqing Wang and Gongfan Fang and Xinyin Ma and Xingyi Yang and Xinchao Wang},
year={2025},
eprint={2509.24014},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.24014},
}