SparseD: Sparse Attention for Diffusion Language Models

SparseD: Sparse Attention for Diffusion Language Models 🥯[Arxiv]
Zeqing Wang¹, Gongfan Fang¹, Xinyin Ma¹ , Xingyi Yang² , Xinchao Wang¹
¹ xML Lab, National University of Singapore
² The Hong Kong Polytechnic University

📚 Introduction

SparseD is a novel sparse attention method for diffusion language models (DLMs), delivering near lossless acceleration in performance. It uses full attention and computes sparse patterns during early denoising steps, then reuses these patterns in later steps to restrict computation and improve efficiency. Extensive experiments show that SparseD greatly maintains accuracy on the evaluated benchmarks while achieving up to $1.50\times$ speedup at a 64k context length with 1,024 steps.

The overview of SparseD

🛠️ Setup

conda create -n SparseD python=3.10
conda activate SparseD
pip install -r requirements.txt

🚀 Usage

# For Dream Model
python dream_generation.py --origin   
python dream_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python dream_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context

# For LLaDA Model
python llada_generation.py --origin 
python llada_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python llada_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context

Arguments:

--model_path: The model path, e.g., Dream-org/Dream-v0-Instruct-7B and GSAI-ML/LLaDA-1.5 .
--seq_len, --steps, --block_length, --sampling-alg: The inference configuration for diffusion generation. block_length refers specifically to the LLaDA model.
--origin: Inference by original model.
--skip, --select, --block_size: The inference configuration for SparseD. skip denotes the ratio of full attention applied to the earlier steps across all denoising steps, select denotes the selection ratio for sparse attention, and block_size specifies the block size used when selecting important query-key pairs.
--prompt: Choose prompt for simple test, including ["short_context", "4k", "8k", "16k", "32k", "64k"] length versions.

📑 Results

1. Accuracy

2. Latency

☀️ Note

Since our sparse attention is implemented by FlexAttention, we recommend conducting a warm-up inference first, as subsequent inferences will perform better in terms of speed.
To better demonstrate the acceleration achieved by SparseD, we recommend evaluating it with long-context prompts, such as those with lengths of 16k, 32k, and 64k. We also provide short context for simple evaluation.

🤓 Acknowledgments

Our sparse attention is accelerated by FlexAttention, and implemented on Dream and LLaDA. We extend our gratitude to the community for their valuable contributions!

🔗 Citation

@misc{wang2025sparsedsparseattentiondiffusion,
      title={SparseD: Sparse Attention for Diffusion Language Models}, 
      author={Zeqing Wang and Gongfan Fang and Xinyin Ma and Xingyi Yang and Xinchao Wang},
      year={2025},
      eprint={2509.24014},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24014}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
models		models
README.md		README.md
dream_generation.py		dream_generation.py
llada_generation.py		llada_generation.py
prompts.json		prompts.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SparseD: Sparse Attention for Diffusion Language Models

📚 Introduction

🛠️ Setup

🚀 Usage

📑 Results

1. Accuracy

2. Latency

☀️ Note

🤓 Acknowledgments

🔗 Citation

About

Uh oh!

Releases

Packages

Languages

INV-WZQ/SparseD

Folders and files

Latest commit

History

Repository files navigation

SparseD: Sparse Attention for Diffusion Language Models

📚 Introduction

🛠️ Setup

🚀 Usage

📑 Results

1. Accuracy

2. Latency

☀️ Note

🤓 Acknowledgments

🔗 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages