-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Hi, thank you for releasing the code and the inspiring paper!
I’ve been experimenting with applying JiT-style x-prediction with a velocity loss to text-to-motion diffusion models, and I ran into an interesting behavior around clipping (1 - t) during sampling.
I wrote up the details in this blog post:
Very briefly, I found that it is not necessary to clip (1 - t) during sampling for the 50-step Heun2 sampler. Keeping the clipping only during training works fine in terms of stability, and in my experiments, it actually gives better sample quality (FID) for text-to-motion diffusion.
I have only verified this behavior on text-to-motion diffusion, but I suspect this modification might also be helpful for image diffusion models. I wanted to share this observation as it might lead to improvements in this great work. It would be great if you could verify this on your end if you are interested.
Thanks again for the great work!