[Suggestion] Removing (1 - t) clipping during sampling might perform better

Hi, thank you for releasing the code and the inspiring paper!

I’ve been experimenting with applying JiT-style x-prediction with a velocity loss to text-to-motion diffusion models, and I ran into an interesting behavior around clipping `(1 - t)` during sampling.

I wrote up the details in this blog post:

> [JiT for Motion Diffusion Models](https://akihisa-watanabe.github.io/jit-for-motion-diffusion-model.html)

Very briefly, I found that **it is not necessary to clip `(1 - t)` during sampling** for the 50-step Heun2 sampler. Keeping the clipping only during training works fine in terms of stability, and in my experiments, it actually gives better sample quality (FID) for text-to-motion diffusion.

I have only verified this behavior on text-to-motion diffusion, but I suspect this modification might also be helpful for image diffusion models. I wanted to share this observation as it might lead to improvements in this great work. It would be great if you could verify this on your end if you are interested.

Thanks again for the great work!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Suggestion] Removing (1 - t) clipping during sampling might perform better #24

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Suggestion] Removing (1 - t) clipping during sampling might perform better #24

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions