Thanks to visit codestin.com
Credit goes to github.com

Skip to content

INV-WZQ/SparseD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparseD: Sparse Attention for Diffusion Language Models

demo

SparseD: Sparse Attention for Diffusion Language Models 🥯[Arxiv]
Zeqing Wang1, Gongfan Fang1, Xinyin Ma1 , Xingyi Yang2 , Xinchao Wang1
1 xML Lab, National University of Singapore
2 The Hong Kong Polytechnic University

📚 Introduction

SparseD is a novel sparse attention method for diffusion language models (DLMs), delivering near lossless acceleration in performance. It uses full attention and computes sparse patterns during early denoising steps, then reuses these patterns in later steps to restrict computation and improve efficiency. Extensive experiments show that SparseD greatly maintains accuracy on the evaluated benchmarks while achieving up to $1.50\times$ speedup at a 64k context length with 1,024 steps.


The overview of SparseD

🛠️ Setup

conda create -n SparseD python=3.10
conda activate SparseD
pip install -r requirements.txt

🚀 Usage

# For Dream Model
python dream_generation.py --origin   
python dream_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python dream_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context

# For LLaDA Model
python llada_generation.py --origin 
python llada_generation.py --skip 0.2 --select 0.3 --block_size 128 --prompt 4k
python llada_generation.py --skip 0.2 --select 0.5 --block_size 32 --prompt short_context

Arguments:

  • --model_path: The model path, e.g., Dream-org/Dream-v0-Instruct-7B and GSAI-ML/LLaDA-1.5 .
  • --seq_len, --steps, --block_length, --sampling-alg: The inference configuration for diffusion generation. block_length refers specifically to the LLaDA model.
  • --origin: Inference by original model.
  • --skip, --select, --block_size: The inference configuration for SparseD. skip denotes the ratio of full attention applied to the earlier steps across all denoising steps, select denotes the selection ratio for sparse attention, and block_size specifies the block size used when selecting important query-key pairs.
  • --prompt: Choose prompt for simple test, including ["short_context", "4k", "8k", "16k", "32k", "64k"] length versions.

📑 Results

1. Accuracy

image

2. Latency

image

☀️ Note

  • Since our sparse attention is implemented by FlexAttention, we recommend conducting a warm-up inference first, as subsequent inferences will perform better in terms of speed.

  • To better demonstrate the acceleration achieved by SparseD, we recommend evaluating it with long-context prompts, such as those with lengths of 16k, 32k, and 64k. We also provide short context for simple evaluation.

🤓 Acknowledgments

Our sparse attention is accelerated by FlexAttention, and implemented on Dream and LLaDA. We extend our gratitude to the community for their valuable contributions!

🔗 Citation

@misc{wang2025sparsedsparseattentiondiffusion,
      title={SparseD: Sparse Attention for Diffusion Language Models}, 
      author={Zeqing Wang and Gongfan Fang and Xinyin Ma and Xingyi Yang and Xinchao Wang},
      year={2025},
      eprint={2509.24014},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24014}, 
}

About

[Arxiv 2025] SparseD: Sparse Attention for Diffusion Language Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages