📣 [Feature Update] ✨ REAL DDIM Inversion ✨ is now possible on CogVideoX!

It is well known that applying DDIM inversion in CogVideoX and attempting to reconstruct from the inverted latent often leads to results with high saturation and a washed-out appearance. 

https://github.com/user-attachments/assets/81f30713-40f6-4d7a-b618-0bb8695b7ddd

## ⏳  Background 
To solve this inverse problem, a [ddim_inversion.py](https://github.com/THUDM/CogVideo/blob/main/inference/ddim_inversion.py) script was recently shared in the CogVideoX repository.

However, this implementation takes a non-standard approach. Instead of directly using the inverted latent as the initial noise for reconstruction, it employs the inverted latent as a reference for the KV caching mechanism. 

Specifically, at each timestep and for every DiT layer, the model performs two separate attention computations:

1. One attention pass using the concatenation of the current noise and the reference latent (key, value with key_reference, value_reference)
2. A second pass using only the reference latent, which is stored for attention sharing in the next layer.
(please refer to [corresponding lines](https://github.com/THUDM/CogVideo/blob/aaab2877ec1a99df1df8fce336e0fe8307300b41/inference/ddim_inversion.py#L217C9-L238C10))

## ✨ Simple and Efficient Solution
In our new paper [Dynamic View Synthesis as an Inverse Problem](https://arxiv.org/abs/2506.08004) we first focus on this inverse problem. 

As a result of our work, one can simply invert & reconstruct a real video using the following steps:
### Inversion Steps
1. Invert the source video using `DDIMInverseScheduler`
2. Save only the inverted latent (Let's call it `latents`)

### Reconstruction Steps 
1. Encode the source video [example implementation](https://github.com/huggingface/diffusers/blob/8c938fb410e79a0d04d727b68edf28e4036c0ca5/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L385C17-L385C113):
 ```python
init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video]
``` 
2. Then apply our proposal K-RNR in [prepare_latents](https://github.com/huggingface/diffusers/blob/8c938fb410e79a0d04d727b68edf28e4036c0ca5/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L350):
```python
k = 3 # see the paper for the why the value 3 is optimal
for i in range(k):
    latents = self.scheduler.add_noise(init_latents, latents)
return latents
```
One can use the resulting `latents` as an input to the transformer block to obtain sharp reconstructions in a training-free and very efficient manner.  More video examples can be found in our [supplementary videos](https://inverse-dvs.github.io/supp.html).

If you use K-RNR, cite us:
```
@article{yesiltepe2025dynamic,
  title={Dynamic View Synthesis as an Inverse Problem},
  author={Yesiltepe, Hidir and Yanardag, Pinar},
  journal={arXiv preprint arXiv:2506.08004},
  year={2025}
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📣 [Feature Update] ✨ REAL DDIM Inversion ✨ is now possible on CogVideoX! #783

⏳ Background

✨ Simple and Efficient Solution

Inversion Steps

Reconstruction Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

📣 [Feature Update] ✨ REAL DDIM Inversion ✨ is now possible on CogVideoX! #783

Description

⏳ Background

✨ Simple and Efficient Solution

Inversion Steps

Reconstruction Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions