Demo: SANA-1.5 | SANA-ControlNet | SANA-4bit | SANA-Sprint
ComfyUI: ComfyUI Guidance
Model Zoo: Model Card Collects All Models
Env Preparation: One-Click Env Install
Inference:
Β Β Β Β 1) diffusers:SanaPipeline
Β Β Β Β 2) diffusers:SanaPAGPipeline
Β Β Β Β 3) Ours:SanaPipeline
Β Β Β Β 4) Inference with Docker
Β Β Β Β 5) Inference with TXT or JSON Files
Training and Data:
Β Β Β Β 1) Image-Text Pairs
Β Β Β Β 2) Multi-Scale Webdataset
Β Β Β Β 3) TAR File Multi-Scale Webdataset
Β Β Β Β 4) FSDP Launch
Β Β Β Β 5) LoRA Training
2K & 4K Resolution Generation: SANA is Capable to Generate 2K & 4K Images (Only 8BG)
ControlNet: Train&Inference Guidance | Model Zoo | Demo
Dreambooth / LoRA Training: Train&Inference Guidance
Quantization: Inference with 8bit | Inference with 4bit (8BG) | 4bit Model | 4bit Demo | 4bit Demo2
8bit Optimizer: How to Config
Inference Scaling: SANA Generate VILA Pick Inference Scaling
Metrics: Metric Toolkit: (FID, CLIP-Score, GenEval, DPG-Bench)
SANA-Sprint: One-Step Diffusion: Arxiv | Train&Inference Guidance | Model Zoo | HF Weights
SANA-1.5: Efficient Model Scaling: Arxiv | Model Zoo | HF Weights
Mission: TODO
- (π₯ New) [2025/3/22] π₯SANA-1.5 is supported in ComfyUI! π: ComfyUI Guidance | ComfyUI Work Flow SANA-1.5 4.8B
- (π₯ New) [2025/3/22] π₯SANA-Sprint code & weights are released! π Include: Training & Inference code and Weights / HF are all released. [Guidance]
- (π₯ New) [2025/3/21] πSana + Inference Scaling is released. [Guidance]
- (π₯ New) [2025/3/16] π₯SANA-1.5 code & weights are released! π Include: DDP/FSDP | TAR file WebDataset | Multi-Scale Training code and Weights | HF are all released.
- (π₯ New) [2025/3/14] πSANA-Sprint is coming out! π A new one/few-step generator of Sana. 0.1s per 1024px image on H100, 0.3s on RTX 4090. Find out more details: [Page] | [Arxiv]. Code is coming very soon along with
diffusers
- (π₯ New) [2025/2/10] πSana + ControlNet is released. [Guidance] | [Model] | [Demo]
- (π₯ New) [2025/1/30] Release CAME-8bit optimizer code. Saving more GPU memory during training. [How to config]
- (π₯ New) [2025/1/29] π π πSANA 1.5 is out! Figure out how to do efficient training & inference scaling! π[Tech Report]
- (π₯ New) [2025/1/24] 4bit-Sana is released, powered by SVDQuant and Nunchaku inference engine. Now run your Sana within 8GB GPU VRAM [Guidance] [Demo] [Model]
- (π₯ New) [2025/1/24] DCAE-1.1 is released, better reconstruction quality. [Model] [diffusers]
- (π₯ New) [2025/1/23] Sana is accepted as Oral by ICLR-2025. πππ
Click to show all updates
- (π₯ New) [2025/1/12] DC-AE tiling makes Sana-4K inferences 4096x4096px images within 22GB GPU memory. With model offload and 8bit/4bit quantize. The 4K Sana run within 8GB GPU VRAM. [Guidance]
- (π₯ New) [2025/1/11] Sana code-base license changed to Apache 2.0.
- (π₯ New) [2025/1/10] Inference Sana with 8bit quantization.[Guidance]
- (π₯ New) [2025/1/8] 4K resolution Sana models is supported in Sana-ComfyUI and work flow is also prepared. [4K guidance]
- (π₯ New) [2025/1/8] 1.6B 4K resolution Sana models are released: [BF16 pth] or [BF16 diffusers]. π Get your 4096x4096 resolution images within 20 seconds! Find more samples in Sana page. Thanks SUPIR for their wonderful work and support.
- (π₯ New) [2025/1/2] Bug in the
diffusers
pipeline is solved. Solved PR - (π₯ New) [2025/1/2] 2K resolution Sana models is supported in Sana-ComfyUI and work flow is also prepared.
- β [2024/12] 1.6B 2K resolution Sana models are released: [BF16 pth] or [BF16 diffusers]. π Get your 2K resolution images within 4 seconds! Find more samples in Sana page. Thanks SUPIR for their wonderful work and support.
- β
[2024/12]
diffusers
supports Sana-LoRA fine-tuning! Sana-LoRA's training and convergence speed is super fast. [Guidance] or [diffusers docs]. - β
[2024/12]
diffusers
has Sana! All Sana models in diffusers safetensors are released and diffusers pipelineSanaPipeline
,SanaPAGPipeline
,DPMSolverMultistepScheduler(with FlowMatching)
are all supported now. We prepare a Model Card for you to choose. - β [2024/12] 1.6B BF16 Sana model is released for stable fine-tuning.
- β [2024/12] We release the ComfyUI node for Sana. [Guidance]
- β [2024/11] All multi-linguistic (Emoji & Chinese & English) SFT models are released: 1.6B-512px, 1.6B-1024px, 600M-512px, 600M-1024px. The metric performance is shown here
- β [2024/11] Sana Replicate API is launching at Sana-API.
- β [2024/11] 1.6B Sana models are released.
- β [2024/11] Training & Inference & Metrics code are released.
- β
[2024/11] Working on
diffusers
. - [2024/10] Demo is released.
- [2024/10] DC-AE Code and weights are released!
- [2024/10] Paper is on Arxiv!
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 Γ 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include:
(1) DC-AE: unlike traditional AEs, which compress images only 8Γ, we trained an AE that can compress images 32Γ, effectively reducing the number of latent tokens.
(2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality.
(3) Decoder-only text encoder: we replaced T5 with a modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment.
(4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
As a result, Sana-0.6B is very competitive with modern giant diffusion models (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 Γ 1024 resolution image. Sana enables content creation at low cost.
Methods (1024x1024) | Throughput (samples/s) | Latency (s) | Params (B) | Speedup | FID π | CLIP π | GenEval π | DPG π |
---|---|---|---|---|---|---|---|---|
FLUX-dev | 0.04 | 23.0 | 12.0 | 1.0Γ | 10.15 | 27.47 | 0.67 | 84.0 |
Sana-0.6B | 1.7 | 0.9 | 0.6 | 39.5Γ | 5.81 | 28.36 | 0.64 | 83.6 |
Sana-0.6B | 1.7 | 0.9 | 0.6 | 39.5Γ | 5.61 | 28.80 | 0.68 | 84.2 |
Sana-1.6B | 1.0 | 1.2 | 1.6 | 23.3Γ | 5.92 | 28.94 | 0.69 | 84.5 |
Sana-1.5 1.6B | 1.0 | 1.2 | 1.6 | 23.3Γ | 5.70 | 29.12 | 0.82 | 84.5 |
Sana-1.5 4.8B | 0.26 | 4.2 | 4.8 | 6.5Γ | 5.99 | 29.23 | 0.81 | 84.7 |
Methods | Throughput (samples/s) | Latency (s) | Params (B) | Speedup | FID π | CLIP π | GenEval π | DPG π |
---|---|---|---|---|---|---|---|---|
512 Γ 512 resolution | ||||||||
PixArt-Ξ± | 1.5 | 1.2 | 0.6 | 1.0Γ | 6.14 | 27.55 | 0.48 | 71.6 |
PixArt-Ξ£ | 1.5 | 1.2 | 0.6 | 1.0Γ | 6.34 | 27.62 | 0.52 | 79.5 |
Sana-0.6B | 6.7 | 0.8 | 0.6 | 5.0Γ | 5.67 | 27.92 | 0.64 | 84.3 |
Sana-1.6B | 3.8 | 0.6 | 1.6 | 2.5Γ | 5.16 | 28.19 | 0.66 | 85.5 |
1024 Γ 1024 resolution | ||||||||
LUMINA-Next | 0.12 | 9.1 | 2.0 | 2.8Γ | 7.58 | 26.84 | 0.46 | 74.6 |
SDXL | 0.15 | 6.5 | 2.6 | 3.5Γ | 6.63 | 29.03 | 0.55 | 74.7 |
PlayGroundv2.5 | 0.21 | 5.3 | 2.6 | 4.9Γ | 6.09 | 29.13 | 0.56 | 75.5 |
Hunyuan-DiT | 0.05 | 18.2 | 1.5 | 1.2Γ | 6.54 | 28.19 | 0.63 | 78.9 |
PixArt-Ξ£ | 0.4 | 2.7 | 0.6 | 9.3Γ | 6.15 | 28.26 | 0.54 | 80.5 |
DALLE3 | - | - | - | - | - | - | 0.67 | 83.5 |
SD3-medium | 0.28 | 4.4 | 2.0 | 6.5Γ | 11.92 | 27.83 | 0.62 | 84.1 |
FLUX-dev | 0.04 | 23.0 | 12.0 | 1.0Γ | 10.15 | 27.47 | 0.67 | 84.0 |
FLUX-schnell | 0.5 | 2.1 | 12.0 | 11.6Γ | 7.94 | 28.14 | 0.71 | 84.8 |
Sana-0.6B | 1.7 | 0.9 | 0.6 | 39.5Γ | 5.81 | 28.36 | 0.64 | 83.6 |
Sana-1.6B | 1.0 | 1.2 | 1.6 | 23.3Γ | 5.76 | 28.67 | 0.66 | 84.8 |
- Python >= 3.10.0 (Recommend to use Anaconda or Miniconda)
- PyTorch >= 2.0.1+cu12.1
git clone https://github.com/NVlabs/Sana.git
cd Sana
./environment_setup.sh sana
# or you can install each components step by step following environment_setup.sh
- 9GB VRAM is required for 0.6B model and 12GB VRAM for 1.6B model. Our later quantization version will require less than 8GB for inference.
- All the tests are done on A100 GPUs. Different GPU version may be different.
π Choose your model: Model card
π Quick start with Gradio
# official online demo
DEMO_PORT=15432 \
python app/app_sana.py \
--share \
--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px_BF16/checkpoints/Sana_1600M_1024px_BF16.pth \
--image_size=1024
Important
Upgrade your diffusers>=0.32.0.dev
to make the SanaPipeline
and SanaPAGPipeline
available!
pip install git+https://github.com/huggingface/diffusers
Make sure to specify pipe.transformer
to default torch_dtype
and variant
according to Model Card.
Set pipe.text_encoder
to BF16 and pipe.vae
to FP32 or BF16. For more info, docs are here.
# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
import torch
from diffusers import SanaPipeline
pipe = SanaPipeline.from_pretrained(
"Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
pipe.vae.to(torch.bfloat16)
pipe.text_encoder.to(torch.bfloat16)
prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
image = pipe(
prompt=prompt,
height=1024,
width=1024,
guidance_scale=4.5,
num_inference_steps=20,
generator=torch.Generator(device="cuda").manual_seed(42),
)[0]
image[0].save("sana.png")
Click to show all
# run `pip install git+https://github.com/huggingface/diffusers` before use Sana in diffusers
import torch
from diffusers import SanaPAGPipeline
pipe = SanaPAGPipeline.from_pretrained(
"Efficient-Large-Model/SANA1.5_1.6B_1024px_diffusers",
torch_dtype=torch.bfloat16,
pag_applied_layers="transformer_blocks.8",
)
pipe.to("cuda")
pipe.text_encoder.to(torch.bfloat16)
pipe.vae.to(torch.bfloat16)
prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
image = pipe(
prompt=prompt,
guidance_scale=5.0,
pag_scale=2.0,
num_inference_steps=20,
generator=torch.Generator(device="cuda").manual_seed(42),
)[0]
image[0].save('sana.png')
Click to show all
import torch
from app.sana_pipeline import SanaPipeline
from torchvision.utils import save_image
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
generator = torch.Generator(device=device).manual_seed(42)
sana = SanaPipeline("configs/sana1-5_config/1024ms/Sana_1600M_1024px_allqknorm_bf16_lr2e5.yaml")
sana.from_pretrained("hf://Efficient-Large-Model/SANA1.5_1.6B_1024px/checkpoints/SANA1.5_1.6B_1024px.pth")
prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
image = sana(
prompt=prompt,
height=1024,
width=1024,
guidance_scale=4.5,
pag_guidance_scale=1.0,
num_inference_steps=20,
generator=generator,
)
save_image(image, 'output/sana.png', nrow=1, normalize=True, value_range=(-1, 1))
Click to show all
# Pull related models
huggingface-cli download google/gemma-2b-it
huggingface-cli download google/shieldgemma-2b
huggingface-cli download mit-han-lab/dc-ae-f32c32-sana-1.1
huggingface-cli download Efficient-Large-Model/Sana_1600M_1024px
# Run with docker
docker build . -t sana
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache:/root/.cache \
sana
# Run samples in a txt file
python scripts/inference.py \
--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
--txt_file=asset/samples/samples_mini.txt
# Run samples in a json file
python scripts/inference.py \
--config=configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--model_path=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
--json_file=asset/samples/samples_mini.json
where each line of asset/samples/samples_mini.txt
contains a prompt to generate
- 32GB VRAM is required for both 0.6B and 1.6B model's training
We provide a training example here and you can also select your desired config file from config files dir based on your data structure.
To launch Sana training, you will first need to prepare data in the following formats. Here is an example for the data structure for reference.
asset/example_data
βββ AAA.txt
βββ AAA.png
βββ BCC.txt
βββ BCC.png
βββ ......
βββ CCC.txt
βββ CCC.png
Then Sana's training can be launched via
# Example of training Sana 0.6B with 512x512 resolution from scratch
bash train_scripts/train.sh \
configs/sana_config/512ms/Sana_600M_img512.yaml \
--data.data_dir="[asset/example_data]" \
--data.type=SanaImgDataset \
--model.multi_scale=false \
--train.train_batch_size=32
# Example of fine-tuning Sana 1.6B with 1024x1024 resolution
bash train_scripts/train.sh \
configs/sana_config/1024ms/Sana_1600M_img1024.yaml \
--data.data_dir="[asset/example_data]" \
--data.type=SanaImgDataset \
--model.load_from=hf://Efficient-Large-Model/Sana_1600M_1024px/checkpoints/Sana_1600M_1024px.pth \
--model.multi_scale=false \
--train.train_batch_size=8
We also provide conversion scripts to convert your data to the required format. You can refer to the data conversion scripts for more details.
python tools/convert_ImgDataset_to_WebDatasetMS_format.py
Then Sana's training can be launched via
# Example of training Sana 0.6B with 512x512 resolution from scratch
bash train_scripts/train.sh \
configs/sana_config/512ms/Sana_600M_img512.yaml \
--data.data_dir="[asset/example_data_tar]" \
--data.type=SanaWebDatasetMS \
--model.multi_scale=true \
--train.train_batch_size=32
We prepared a toy TAR dataset containing 100 random images from Journey-DB, duplicated for testing purposes. Note that this dataset is not intended for training.
huggingface-cli download Efficient-Large-Model/toy_data --repo-type dataset --local-dir ./data/toy_data --local-dir-use-symlinks False
Then, you are ready to run with FSDP or DDP:
# DDP
# Example of training Sana 1.6B with 512x512 resolution from scratch
bash train_scripts/train.sh \
configs/sana1-5_config/1024ms/Sana_1600M_1024px_allqknorm_bf16_lr2e5.yaml \
--data.data_dir="[data/toy_data]" \
--data.type=SanaWebDatasetMS \
--model.multi_scale=true \
--data.load_vae_feat=true \
--train.train_batch_size=2
# FSDP
# Example of training Sana 1.6B with 512x512 resolution from scratch
bash train_scripts/train.sh \
configs/sana1-5_config/1024ms/Sana_1600M_1024px_AdamW_fsdp.yaml \
--data.data_dir="[data/toy_data]" \
--data.type=SanaWebDatasetMS \
--model.multi_scale=true \
--data.load_vae_feat=true \
--train.use_fsdp=true \
--train.train_batch_size=2
Refer to Toolkit Manual.
We trained a specialized NVILA-2B model to score images, which we named VISA (VIla as SAna verifier). By selecting the top 4 images from 2,048 candidates, we enhanced the GenEval performance of SD1.5 and SANA-1.5-4.8B v2, increasing their scores from 42 to 87 and 81 to 96, respectively. Details refer to Inference Scaling Manual.
Method | Overall | Single | Two | Counting | Colors | Position | Color Attribution |
---|---|---|---|---|---|---|---|
SD1.5 | 0.42 | 0.98 | 0.39 | 0.31 | 0.72 | 0.04 | 0.06 |
+ Inference Scaling | 0.87 | 1.00 | 0.97 | 0.93 | 0.96 | 0.75 | 0.62 |
SANA-1.5 4.8B v2 | 0.81 | 0.99 | 0.86 | 0.86 | 0.84 | 0.59 | 0.65 |
+ Inference Scaling | 0.96 | 1.00 | 1.00 | 0.97 | 0.94 | 0.96 | 0.87 |
Our SANA-Sprint models focus on timestep distillation, achieving high-quality generation with 1-4 inference steps. Refer to SANA-Sprint Manual for more details.
We will try our best to achieve
- [β ] Training code
- [β ] Inference code
- [β ] Model zoo
- [β ] ComfyUI
- [β ] DC-AE Diffusers
- [β ] Sana merged in Diffusers(huggingface/diffusers#9982)
- [β
] LoRA training by @paul(
diffusers
: huggingface/diffusers#10234) - [β ] 2K/4K resolution models.(Thanks @SUPIR to provide a 4K super-resolution model)
- [β ] 8bit / 4bit Laptop development
- [β ] ControlNet (train & inference & models)
- [β ] FSDP Training
- [β ] SANA-1.5 (Larger model size / Inference Scaling)
- [β ] SANA-Sprint: Few-step generator
- [π»] Better re-construction F32/F64 VAEs.
- [π] Video Generation
Thanks to the following open-sourced codebase for their wonderful work and codebase!
Thanks goes to these wonderful contributors:
@misc{xie2024sana,
title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
year={2024},
eprint={2410.10629},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.10629},
}
@misc{xie2025sana,
title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer},
author={Xie, Enze and Chen, Junsong and Zhao, Yuyang and Yu, Jincheng and Zhu, Ligeng and Lin, Yujun and Zhang, Zhekai and Li, Muyang and Chen, Junyu and Cai, Han and others},
year={2025},
eprint={2501.18427},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.18427},
}
@misc{chen2025sanasprint,
title={SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation},
author={Junsong Chen and Shuchen Xue and Yuyang Zhao and Jincheng Yu and Sayak Paul and Junyu Chen and Han Cai and Enze Xie and Song Han},
year={2025},
eprint={2503.09641},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.09641},
}