ConditionVideo: Training-Free Condition-Guided Video Generation
DOI:
https://doi.org/10.1609/aaai.v38i5.28244Keywords:
CV: Computational Photography, Image & Video Synthesis, CV: Applications, CV: Language and VisionAbstract
Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.Downloads
Published
2024-03-24
How to Cite
Peng, B., Chen, X., Wang, Y., Lu, C., & Qiao, Y. (2024). ConditionVideo: Training-Free Condition-Guided Video Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(5), 4459-4467. https://doi.org/10.1609/aaai.v38i5.28244
Issue
Section
AAAI Technical Track on Computer Vision IV