Thanks to visit codestin.com
Credit goes to github.com

Add VidTok AutoEncoders #11261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

annitang1997 wants to merge 4 commits into huggingface:main from annitang1997:add_autoencoder_vidtok

annitang1997 commented Apr 9, 2025

We add VidTok, a versatile and state-of-the-art video tokenizer, as an autoencoder model to diffusers.

Paper: https://arxiv.org/pdf/2412.13061
Code: https://github.com/microsoft/VidTok
Model: https://huggingface.co/microsoft/VidTok

annitang1997 added 2 commits

April 10, 2025 00:47


          add_autoencoder_vidtok

371aa27


          Merge branch 'main' into add_autoencoder_vidtok

b2dc1ef

Member

a-r-r-o-w commented Apr 10, 2025

Thank you for the PR @annitang1997! I will review this in depth soon. cc @yiyixuxu too


          Merge branch 'main' into add_autoencoder_vidtok

0ce6be7

deeptimhe commented Apr 20, 2025 •

edited

Loading

Is there any updates on the review process? 👀 Looking forward to use VidTok with diffusers.

a-r-r-o-w reviewed

View reviewed changes

Member

a-r-r-o-w left a comment

Thank you for the PR and congratulations for the release of your awesome work!

I did a first pass review about some changes that need to be made to make the implementation similar to remaining of the diffusers codebase. There are some core implementation details that will have to be refactored before we can merge. A good reference implementation for autoencoders can be found here:

I'd be happy to help assist in making some of these changes! 🤗

src/diffusers/models/autoencoders/vae.py

		@@ -688,6 +689,158 @@ def get_codebook_entry(self, indices: torch.LongTensor, shape: Tuple[int, ...])
		return z_q


		class FSQRegularizer(nn.Module):

Member

a-r-r-o-w Apr 21, 2025

We're moving towards maintaining a single file per modeling implementation, and so let's move this to the vidtok autoencoder file

src/diffusers/models/downsampling.py

		@@ -285,6 +285,27 @@ def forward(self, inputs: torch.Tensor) -> torch.Tensor:
		return F.conv2d(inputs, weight, stride=2)


		class VidTokDownsample2D(nn.Module):

Member

a-r-r-o-w Apr 21, 2025

Let's move this to vidtok autoencoder file as well

src/diffusers/models/normalization.py

		@@ -470,6 +471,28 @@ def forward(
		return hidden_states, encoder_hidden_states, gate[:, None, :], enc_gate[:, None, :]


		class VidTokLayerNorm(nn.Module):

Member

a-r-r-o-w Apr 21, 2025

Let's move this to vidtok autoencoder file as well

src/diffusers/models/upsampling.py

		@@ -356,6 +356,26 @@ def forward(self, inputs: torch.Tensor) -> torch.Tensor:
		return F.conv_transpose2d(inputs, weight, stride=2, padding=self.pad * 2 + 1)


		class VidTokUpsample2D(nn.Module):

Member

a-r-r-o-w Apr 21, 2025

Let's move this to vidtok autoencoder file as well

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+              import torch
+              import torch.nn as nn
+              import torch.nn.functional as F
+              from einops import pack, rearrange, unpack

Member

a-r-r-o-w Apr 21, 2025

Need to replace all einops operations with permute/reshape/other ops since it adds another dependancy which we don't use in the codebase

src/diffusers/models/autoencoders/autoencoder_vidtok.py

Comment on lines +604 to +610

+                          def create_custom_forward(module):
+                              def custom_forward(*inputs):
+                                  return module.downsample(*inputs)
+                              return custom_forward

Member

a-r-r-o-w Apr 21, 2025

Suggested change

      
                        def create_custom_forward(module):
          
                            def custom_forward(*inputs):
          
                                return module.downsample(*inputs)
          
                            return custom_forward

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+                              if i_level in self.spatial_ds:
+                                  # spatial downsample
+                                  htmp = rearrange(hs[-1], "b c t h w -> (b t) c h w")
+                                  htmp = torch.utils.checkpoint.checkpoint(create_custom_forward(self.down[i_level]), htmp)

Member

a-r-r-o-w Apr 21, 2025

Suggested change

      
                                htmp = torch.utils.checkpoint.checkpoint(create_custom_forward(self.down[i_level]), htmp)
          
                                htmp = self._gradient_checkpointing_func(self.down[i_level], htmp)

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+                                  B, _, T, H, W = htmp.shape
+                          # middle
+                          h = hs[-1]
+                          h = torch.utils.checkpoint.checkpoint(self.mid.block_1, h, temb)

Member

a-r-r-o-w Apr 21, 2025

same comment as above for these usages

src/diffusers/models/autoencoders/autoencoder_vidtok.py

		return h


		class AutoencoderVidTok(ModelMixin, ConfigMixin, FromOriginalModelMixin):

Member

a-r-r-o-w Apr 21, 2025

Suggested change

      
            class AutoencoderVidTok(ModelMixin, ConfigMixin, FromOriginalModelMixin):
          
            class AutoencoderVidTok(ModelMixin, ConfigMixin):

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+                      self.tile_overlap_factor_width = 0.0  # 1 / 8
+                  @staticmethod
+                  def pad_at_dim(

Member

a-r-r-o-w Apr 21, 2025

Any methods that are not to be directly invoked by users should be made private (that is prefix with an underscore _pad_at_dim)


          Merge branch 'huggingface:main' into add_autoencoder_vidtok

f0f5c58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet