ControlVideo: Training-free Controllable Text-to-Video Generation

Zhang, Yabo; Wei, Yuxiang; Jiang, Dongsheng; Zhang, Xiaopeng; Zuo, Wangmeng; Tian, Qi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.13077 (cs)

[Submitted on 22 May 2023]

Title:ControlVideo: Training-free Controllable Text-to-Video Generation

Authors:Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian

View PDF

Abstract:Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at this https URL.

Comments:	Code is available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.13077 [cs.CV]
	(or arXiv:2305.13077v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.13077

Submission history

From: Yabo Zhang [view email]
[v1] Mon, 22 May 2023 14:48:53 UTC (25,502 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ControlVideo: Training-free Controllable Text-to-Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ControlVideo: Training-free Controllable Text-to-Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators