Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by
Fig 1. DiT-S/4 samples without (left) and with (right) magnitude preserving layers.
This project builds upon key concepts from the following research papers:
- Peebles & Xie (2023) explore the application of transformer architectures to diffusion models, achieving state-of-the-art performance on various generation tasks;
- Karras et al. (2024) introduce the idea of preserving the magnitude of features during the diffusion process, enhancing the stability and quality of generated outputs.
We're actively developing this repo. Contributions and feedback are welcome!
python train.py --data-path /path/to/data --results-dir /path/to/results --model DiT-S/2 --num-steps 400_000 <map feature flags>Customize the training process by enabling the following flags:
--use-cosine-attention- Controls weight growth in attention layers.--use-weight-normalization- Applies magnitude preservation in linear layers.--use-forced-weight-normalization- Controls weight growth in linear layers.--use-mp-residual- Enables magnitude preservation in residual connections.--use-mp-silu- Uses a magnitude-preserving version of SiLU nonlinearity.--use-no-layernorm- Disables transformer layer normalization.--use-mp-pos-enc- Activates magnitude-preserving positional encoding.--use-mp-embedding- Uses magnitude-preserving embeddings.
python sample.py --result-dir /path/to/results/<dir> --class-label <class label>@misc{bill2025exploringmagnitudepreservationrotation,
title={Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers},
author={Eric Tillman Bill and Cristian Perez Jensen and Sotiris Anagnostidis and Dimitri von Rütte},
year={2025},
eprint={2505.19122},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.19122},
}