dInfer: An Efficient Inference Framework for Diffusion Language Models

dInfer is an efficient and extensible inference framework for dLLMs. It modularizes inference into four components: model, diffusion iteration manager, decoding strategy and KV-cache management, and provides well-designed APIs for flexible combinations of algorithms in each component.

Figure: Overall Architecture of dInfer

dInfer supports multiple dLLM variants, including LLaDA and LLaDA-MoE. It introduces multiple algorithms in each of the components to improve the decoding quality and inference speed. This includes a soft diffusion iteration algorithm for smoother denoising, hierarchical and credit decoding for enhanced parallel decoding, and a vicinity refresh strategy for KV-cache management to mitigate cache staleness. Beyond algorithmic improvements, it integrates several system-level optimizations. It supports both tensor parallelism (TP) and expert parallelism (EP) to maximize GPU utilization even at batch size 1. It leverages PyTorch compilation and NVIDIA CUDA Graphs for efficient kernel execution, and introduces a loop unrolling mechanism to eliminate CUDA stream bubbles across diffusion iterations.

Benchmark results

Figure: Benchmark results

On HumanEval, dInfer achieves over 1,100 TPS at batch size 1, and averages more than 800 TPS across six benchmarks on a single node with $8\times$ H800 GPUs. Compared to Fast-dLLM, dInfer delivers more than a $10\times$ speedup while maintaining accuracy; on LLaDA-MoE it provides a $2-3\times$ speedup over QWen2.5-3B on vLLM with comparable quality.

Get started

Please follow the instruction below to install dInfer.

git clone https://github.com/inclusionAI/dInfer.git
cd dInfer
pip install .

Run dInfer with LLaDA-MoE downloaded from HuggingFace

This project supports using LLaDA(-MoE) checkpoints from HuggingFace. After downloading a model, run the CPU conversion script to fuse MoE experts into FusedMoE format that can be loaded locally.

Step 1: Download checkpoints

pip install -U huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

# Example: Instruct checkpoint
hf download inclusionAI/LLaDA-MoE-7B-A1B-Instruct \
  --repo-type model \
  --local-dir /path/to/LLaDA-MoE-7B-A1B-Instruct

Step 2: Convert to FusedMoE format

We need to convert the model weight format to support FusedMoE. Use the conversion tool to fuse the experts in the MoE layer.

# From repo root
python tools/transfer.py \
  --input  /path/to/LLaDA-MoE-7B-A1B-Instruct \
  --output /path/to/LLaDA-MoE-7B-A1B-Instruct-fused

Step 3: Use the model in dInfer

import os
import torch
from transformers import AutoTokenizer, AutoConfig
from vllm import distributed
from vllm.config import ParallelConfig
from vllm.config import VllmConfig, set_current_vllm_config

from dinfer.model import FusedOlmoeForCausalLM
from dinfer import BlockIteratorFactory, KVCacheFactory
from dinfer import ThresholdParallelDecoder, BlockWiseDiffusionLLM

m = "/path/to/LLaDA-MoE-7B-A1B-Instruct-fused"
tokenizer = AutoTokenizer.from_pretrained(m, trust_remote_code=True)

device = torch.device(0)
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12346'
distributed.init_distributed_environment(1, 0, 'env://', 0, 'nccl')
distributed.initialize_model_parallel(1, backend='nccl')
parallel_config = ParallelConfig(enable_expert_parallel = True)
with set_current_vllm_config(VllmConfig(parallel_config = parallel_config)):
    model_config = AutoConfig.from_pretrained(m, trust_remote_code=True)
    model = FusedOlmoeForCausalLM(config=model_config).eval()
    model.load_weights(m, torch_dtype=torch.bfloat16)
    model = model.to(device)

decoder = ThresholdParallelDecoder(0, threshold=0.9, mask_id=156895, eos_id=156892)
dllm = BlockWiseDiffusionLLM(model, decoder, BlockIteratorFactory(True), cache_factory=KVCacheFactory('dual'))

prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she can run 6 kilometers per hour. How many kilometers can she run in 8 hours?"
m = [{"role": "user", "content": prompt}, ]
prompt = tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
input_ids = tokenizer(prompt)['input_ids']
input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
res = dllm.generate(input_ids, gen_length=1024, block_length=64)
print(tokenizer.decode(res[0, input_ids.shape[1]:], skip_special_tokens=False))

Cite

@article{dinfer,
    title={dInfer: An Efficient Inference Framework for Diffusion Language Models},
    author={Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng},
    year={2025},
    journal={arXiv preprint arXiv:2510.08666}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
benchmarks		benchmarks
python/dinfer		python/dinfer
tests		tests
tools		tools
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dInfer: An Efficient Inference Framework for Diffusion Language Models

Benchmark results

Get started

Run dInfer with LLaDA-MoE downloaded from HuggingFace

Cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

inclusionAI/dInfer

Folders and files

Latest commit

History

Repository files navigation

dInfer: An Efficient Inference Framework for Diffusion Language Models

Benchmark results

Get started

Run dInfer with LLaDA-MoE downloaded from HuggingFace

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages