Thanks to visit codestin.com
Credit goes to thu-ml.github.io

Logo

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers


1THU, 2ShengShu, 3UT-Austin, 4RUC, 5Princeton

arXiv    Code

Overview


Motivation: Despite advances, video diffusion transformers still struggle to generalize beyond their training length. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view— attention maps, which directly govern how context influences outputs.

Analysis: We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings.

Method: Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2× to 4×



Training-free 3× Extrapolation




Training-free 4× Extrapolation






3× Extrapolation of Wan-VACE for Downstream Tasks






BibTeX


        @article{zhao2025ultravico,
          title={UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers},
          author={Zhao, Min and Zhu, Hongzhou and Wang, Yingze and Yan, Bokai and Zhang, Jintao and He, Guande and Yang, Ling and Li, Chongxuan and Zhu, Jun},
          journal={arXiv preprint arXiv:2511.20123},
          year={2025}
        }