-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
It is well known that applying DDIM inversion in CogVideoX and attempting to reconstruct from the inverted latent often leads to results with high saturation and a washed-out appearance.
k_rnr_reconstruction.mp4
β³ Background
To solve this inverse problem, a ddim_inversion.py script was recently shared in the CogVideoX repository.
However, this implementation takes a non-standard approach. Instead of directly using the inverted latent as the initial noise for reconstruction, it employs the inverted latent as a reference for the KV caching mechanism.
Specifically, at each timestep and for every DiT layer, the model performs two separate attention computations:
- One attention pass using the concatenation of the current noise and the reference latent (key, value with key_reference, value_reference)
- A second pass using only the reference latent, which is stored for attention sharing in the next layer.
(please refer to corresponding lines)
β¨ Simple and Efficient Solution
In our new paper Dynamic View Synthesis as an Inverse Problem we first focus on this inverse problem.
As a result of our work, one can simply invert & reconstruct a real video using the following steps:
Inversion Steps
- Invert the source video using
DDIMInverseScheduler - Save only the inverted latent (Let's call it
latents)
Reconstruction Steps
- Encode the source video example implementation:
init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video]- Then apply our proposal K-RNR in prepare_latents:
k = 3 # see the paper for the why the value 3 is optimal
for i in range(k):
latents = self.scheduler.add_noise(init_latents, latents)
return latentsOne can use the resulting latents as an input to the transformer block to obtain sharp reconstructions in a training-free and very efficient manner. More video examples can be found in our supplementary videos.
If you use K-RNR, cite us:
@article{yesiltepe2025dynamic,
title={Dynamic View Synthesis as an Inverse Problem},
author={Yesiltepe, Hidir and Yanardag, Pinar},
journal={arXiv preprint arXiv:2506.08004},
year={2025}
}