👏 Join our WeChat and Discord | 💻 Official website(官网) Try our model!
- September 28, 2025: 📖 HunyuanImage-3.0 Technical Report Released - Comprehensive technical documentation now available
- September 28, 2025: 🚀 HunyuanImage-3.0 Open Source Release - Inference code and model weights publicly available
If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.
- HunyuanImage-3.0 (Image Generation Model)
- Inference
- HunyuanImage-3.0 Checkpoints
- HunyuanImage-3.0-Instruct Checkpoints (with reasoning)
- VLLM Support
- Distilled Checkpoints
- Image-to-Image Generation
- Multi-turn Interaction
- 🔥🔥🔥 News
- 🧩 Community Contributions
- 📑 Open-source Plan
- 📖 Introduction
- ✨ Key Features
- 🛠️ Dependencies and Installation
- 🚀 Usage
- 🧱 Models Cards
- 📝 Prompt Guide
- 📊 Evaluation
- 📚 Citation
- 🙏 Acknowledgements
- 🌟🚀 Github Star History
HunyuanImage-3.0 is a groundbreaking native multimodal model that unifies multimodal understanding and generation within an autoregressive framework. Our text-to-image module achieves performance comparable to or surpassing leading closed-source models.
-
🧠 Unified Multimodal Architecture: Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation.
-
🏆 The Largest Image Generation MoE Model: This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.
-
🎨 Superior Image Generation Performance: Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.
-
💭 Intelligent World-Knowledge Reasoning: The unified multimodal architecture endows HunyuanImage-3.0 with powerful reasoning capabilities. It leverages its extensive world knowledge to intelligently interpret user intent, automatically elaborating on sparse prompts with contextually appropriate details to produce superior, more complete visual outputs.
- 🖥️ Operating System: Linux
- 🎮 GPU: NVIDIA GPU with CUDA support
- 💾 Disk Space: 170GB for model weights
- 🧠 GPU Memory: ≥3×80GB (4×80GB recommended for better performance)
- 🐍 Python: 3.12+ (recommended and tested)
- 🔥 PyTorch: 2.7.1
- ⚡ CUDA: 12.8
# 1. First install PyTorch (CUDA 12.8 Version)
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
# 2. Then install tencentcloud-sdk
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python
# 3. Then install other dependencies
pip install -r requirements.txt
For up to 3x faster inference, install these optimizations:
# FlashAttention for faster attention computation
pip install flash-attn==2.8.3 --no-build-isolation
# FlashInfer for optimized moe inference. v0.3.1 is tested.
pip install flashinfer-python
💡Installation Tips: It is critical that the CUDA version used by PyTorch matches the system's CUDA version. FlashInfer relies on this compatibility when compiling kernels at runtime. Pytorch 2.7.1+cu128 is tested. GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.
⚡ Performance Tips: These optimizations can significantly speed up your inference!
💡Notation: When FlashInfer is enabled, the first inference may be slower (about 10 minutes) due to kernel compilation. Subsequent inferences on the same machine will be much faster.
# Download from HuggingFace and rename the directory.
# Notice that the directory name should not contain dots, which may cause issues when loading using Transformers.
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
from transformers import AutoModelForCausalLM
# Load the model
model_id = "./HunyuanImage-3"
# Currently we can not load the model using HF model_id `tencent/HunyuanImage-3.0` directly
# due to the dot in the name.
kwargs = dict(
attn_implementation="sdpa", # Use "flash_attention_2" if FlashAttention is installed
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
moe_impl="eager", # Use "flashinfer" if FlashInfer is installed
)
model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)
# generate the image
prompt = "A brown and white dog is running on the grass"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
# Download from HuggingFace
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, for optimal results currently, we recommend community partners to use deepseek to rewrite the prompts. You can go to Tencent Cloud to apply for an API Key.
# set env
export DEEPSEEK_KEY_ID="your_deepseek_key_id"
export DEEPSEEK_KEY_SECRET="your_deepseek_key_secret"
python3 run_image_gen.py --model-id ./HunyuanImage-3 --verbose 1 --sys-deepseek-prompt "universal" --prompt "A brown and white dog is running on the grass"
Arguments | Description | Default |
---|---|---|
--prompt |
Input prompt | (Required) |
--model-id |
Model path | (Required) |
--attn-impl |
Attention implementation. Either sdpa or flash_attention_2 . |
sdpa |
--moe-impl |
MoE implementation. Either eager or flashinfer |
eager |
--seed |
Random seed for image generation | None |
--diff-infer-steps |
Diffusion infer steps | 50 |
--image-size |
Image resolution. Can be auto , like 1280x768 or 16:9 |
auto |
--save |
Image save path. | image.png |
--verbose |
Verbose level. 0: No log; 1: log inference information. | 0 |
--rewrite |
Whether to enable rewriting | 1 |
--sys-deepseek-prompt |
Select sys-prompt from universal or text_rendering |
universal |
Launch an interactive web interface for easy text-to-image generation.
pip install gradio>=4.21.0
# Set your model path
export MODEL_ID="path/to/your/model"
# Optional: Configure GPU usage (default: 0,1,2,3)
export GPUS="0,1,2,3"
# Optional: Configure host and port (default: 0.0.0.0:443)
export HOST="0.0.0.0"
export PORT="443"
Basic Launch:
sh run_app.sh
With Performance Optimizations:
# Use both optimizations for maximum performance
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2
🌐 Web Interface: Open your browser and navigate to
http://localhost:443
(or your configured port)
Model | Params | Download | Recommended VRAM | Supported |
---|---|---|---|---|
HunyuanImage-3.0 | 80B total (13B active) | HuggingFace | ≥ 3 × 80 GB | ✅ Text-to-Image |
HunyuanImage-3.0-Instruct | 80B total (13B active) | HuggingFace | ≥ 3 × 80 GB | ✅ Text-to-Image ✅ Prompt Self-Rewrite ✅ CoT Think |
Notes:
- Install performance extras (FlashAttention, FlashInfer) for faster inference.
- Multi‑GPU inference is recommended for the Base model.
The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, Instruct Checkpoint can rewrite or enhance input prompts with thinking . For optimal results currently, we recommend community partners consulting our official guide on how to write effective prompts.
Reference: HunyuanImage 3.0 Prompt Handbook
We've included two system prompts in the PE folder of this repository that leverage DeepSeek to automatically enhance user inputs:
- system_prompt_universal: This system prompt converts photographic style, artistic prompts into a detailed one.
- system_prompt_text_rendering: This system prompt converts UI/Poster/Text Rending prompts to a deailed on that suits the model.
Note that these system prompts are in Chinese because Deepseek works better with Chinese system prompts. If you want to use it for English oriented model, you may translate it into English or refer to the comments in the PE file as a guide.
We also create a Yuanqi workflow to implent the universal one, you can directly try it.
-
Content Priority: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters. Keywords can be added both before and after this structure.
-
Image resolution: Our model not only supports multiple resolutions but also offers both automatic and specified resolution options. In auto mode, the model automatically predicts the image resolution based on the input prompt. In specified mode (like traditional DiT), the model outputs an image resolution that strictly aligns with the user's chosen resolution.
Our model can follow complex instructions to generate high‑quality, creative images.
Our model can effectively process very long text inputs, enabling users to precisely control the finer details of generated images. Extended prompts allow for intricate elements to be accurately captured, making it ideal for complex projects requiring precision and creativity.
- 🤖 SSAE (Machine Evaluation)
SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.
- 👥 GSB (Human Evaluation)
We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators.
If you find HunyuanImage-3.0 useful in your research, please cite our work:
@misc{HunyuanImage-3.0,
title={HunyuanImage 3.0: Technical Report},
author={Tencent Hunyuan Team},
year={2025},
howpublished={\url{https://github.com/Tencent-Hunyuan/HunyuanImage-3.0}},
}
We extend our heartfelt gratitude to the following open-source projects and communities for their invaluable contributions:
- 🤗 Transformers - State-of-the-art NLP library
- 🎨 Diffusers - Diffusion models library
- 🌐 HuggingFace - AI model hub and community
- ⚡ FlashAttention - Memory-efficient attention
- 🚀 FlashInfer - Optimized inference engine