Thanks to visit codestin.com
Credit goes to github.com

Skip to content

πŸ“£ [Feature Update] ✨ REAL DDIM Inversion ✨ is now possible on CogVideoX!Β #783

@yesiltepe-hidir

Description

@yesiltepe-hidir

It is well known that applying DDIM inversion in CogVideoX and attempting to reconstruct from the inverted latent often leads to results with high saturation and a washed-out appearance.

k_rnr_reconstruction.mp4

⏳ Background

To solve this inverse problem, a ddim_inversion.py script was recently shared in the CogVideoX repository.

However, this implementation takes a non-standard approach. Instead of directly using the inverted latent as the initial noise for reconstruction, it employs the inverted latent as a reference for the KV caching mechanism.

Specifically, at each timestep and for every DiT layer, the model performs two separate attention computations:

  1. One attention pass using the concatenation of the current noise and the reference latent (key, value with key_reference, value_reference)
  2. A second pass using only the reference latent, which is stored for attention sharing in the next layer.
    (please refer to corresponding lines)

✨ Simple and Efficient Solution

In our new paper Dynamic View Synthesis as an Inverse Problem we first focus on this inverse problem.

As a result of our work, one can simply invert & reconstruct a real video using the following steps:

Inversion Steps

  1. Invert the source video using DDIMInverseScheduler
  2. Save only the inverted latent (Let's call it latents)

Reconstruction Steps

  1. Encode the source video example implementation:
init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video]
  1. Then apply our proposal K-RNR in prepare_latents:
k = 3 # see the paper for the why the value 3 is optimal
for i in range(k):
    latents = self.scheduler.add_noise(init_latents, latents)
return latents

One can use the resulting latents as an input to the transformer block to obtain sharp reconstructions in a training-free and very efficient manner. More video examples can be found in our supplementary videos.

If you use K-RNR, cite us:

@article{yesiltepe2025dynamic,
  title={Dynamic View Synthesis as an Inverse Problem},
  author={Yesiltepe, Hidir and Yanardag, Pinar},
  journal={arXiv preprint arXiv:2506.08004},
  year={2025}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions