Genjo: Large Language Diffusion Model

We introduce Genjo, a Large Language Diffusion model with unprecedented performance and scale. Genjo serves as the core foundation for our advanced diffusion-based language models, which will integrate with our Enso diffusion Mixture of Experts (MoE) architecture.

Overview

Genjo represents a novel approach to language modeling using diffusion models rather than traditional autoregressive approaches. The model employs a masked diffusion technique that:

Uses a Transformer Encoder with bidirectional attention instead of a Decoder with causal attention
Employs a varying masking ratio that provides an upper bound on the negative log-likelihood
Achieves competitive performance with autoregressive models in numerous benchmarks

Integration with Enso

Genjo will serve as the foundation for integration with our Enso diffusion Mixture of Experts (MoE) architecture. This integration will enable:

Scaling to Larger Models: Using MoE architecture to scale to 16B+ parameters while maintaining efficiency
Expert Specialization: Leveraging specialized experts for different linguistic tasks and patterns
Efficient Inference: Optimizing sampling through expert routing and batched computation
Improved Performance: Combining the strengths of diffusion models with the parameter efficiency of MoE

Inference

The Genjo-8B-Base and Genjo-8B-Instruct models are available on Hugging Face. Please first install transformers==4.38.2 and use the transformers library to load them.

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('GSAI-ML/Genjo-8B-Base', trust_remote_code=True)
model = AutoModel.from_pretrained('GSAI-ML/Genjo-8B-Base', trust_remote_code=True, torch_dtype=torch.bfloat16)

We provide get_log_likelihood() and generate() functions in get_log_likelihood.py and generate.py respectively, for conditional likelihood evaluation and conditional generation.

You can directly run python chat.py to have multi-round conversations with Genjo-8B-Instruct.

In addition, please refer to our paper and GUIDELINES.md for more details about the inference methods.

Gradio Demo

Thank you very much to apolinário for helping us create this amazing demo!

First, install Gradio pip install gradio, and then you can directly run python app.py

Model Architecture

Genjo uses a Transformer Encoder as the backbone for its mask predictor. The key differences from autoregressive models (like LLaMA) and other masked language models (like BERT) are:

Bidirectional Attention: Removes causal masking in self-attention, allowing the model to attend to all tokens.
Dynamic Masking Ratio: Employs a masking ratio that varies randomly between 0 and 1 during training.
Diffusion Process: Uses an iterative denoising process for text generation rather than autoregressive token prediction.

The model performs generation through an iterative masked diffusion process:

Start with a prompt and fully masked response
Predict all masked tokens simultaneously
Selectively unmask tokens based on confidence or random selection
Repeat until all tokens are unmasked

Evaluation

We use two evaluation methods:

Conditional likelihood estimation for specific metrics
Conditional generation for other benchmarks

The model achieves competitive performance on a wide range of benchmarks including:

BBH (Big-Bench Hard)
GSM8K (Grade School Math 8K)
Math
HumanEval
MBPP (Mostly Basic Python Programming)

Future Directions

Our roadmap for Genjo + Enso integration includes:

Sampling Efficiency Optimizations:
- Implementing semi-autoregressive sampling approaches
- Applying consistency distillation to reduce sampling steps
- Leveraging MoE architecture for more efficient inference
Model Scaling:
- Integrating Enso's MoE architecture to scale to 16B+ parameters
- Implementing expert specialization for different linguistic tasks
- Optimizing expert routing for language processing
Training Improvements:
- Applying rectified flow training for better performance and faster convergence
- Implementing DeepSpeed for efficient training of large-scale models
- Exploring hybrid MoE architectures combining dense and sparse layers

Citation

If you find our work useful, please cite:

@article{nie2025large,
  title={Large Language Diffusion Models},
  author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

For the MoE architecture integration:

@article{FeiDiTMoE2024,
  title={Scaling Diffusion Transformers to 16 Billion Parameters},
  author={Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Jusnshi Huang},
  year={2024},
  journal={arXiv preprint},
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
imgs		imgs
visualization		visualization
EVAL.md		EVAL.md
GUIDELINES.md		GUIDELINES.md
LICENSE		LICENSE
LLM.md		LLM.md
README.md		README.md
app.py		app.py
chat.py		chat.py
demo_integration.py		demo_integration.py
eval_llada.py		eval_llada.py
eval_llada.sh		eval_llada.sh
generate.py		generate.py
get_log_likelihood.py		get_log_likelihood.py
integration.py		integration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genjo: Large Language Diffusion Model

Overview

Integration with Enso

Inference

Gradio Demo

Model Architecture

Evaluation

Future Directions

Citation

About

Uh oh!

Releases

Packages

Languages

License

hanzoai/sho

Folders and files

Latest commit

History

Repository files navigation

Genjo: Large Language Diffusion Model

Overview

Integration with Enso

Inference

Gradio Demo

Model Architecture

Evaluation

Future Directions

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages