Hi, thank you for this excellent work! The insight that x-prediction is fundamentally more tractable for high-dimensional diffusion due to the manifold assumption is both elegant and practically impactful.
I'm very interested in reproducing the ablation experiments shown in Table 2 of the paper, which compares all 9 combinations of prediction space (x-pred, ε-pred, v-pred) and loss space (x-loss, ε-loss, v-loss) on ImageNet 256×256 with JiT-B/16.
Currently, the official implementation only provides the final model variants (x-prediction with v-loss). To facilitate further research and reproducibility, would it be possible to provide either:
-
Pre-trained checkpoints for the other 8 combinations in Table 2 (especially the ε-pred and v-pred variants that show catastrophic failure), or
-
Detailed training configurations / scripts that were used to run these ablation experiments?
Having access to these would be extremely valuable for:
- Reproducing the key empirical findings of the paper
- Conducting further analysis on why ε/v-prediction fails in high-dimensional settings
- Building upon this work for other domains (e.g., robotics, scientific data)
Thank you for open-sourcing your work, and I appreciate any guidance you can provide!