Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Cui, Jiahao; Li, Hui; Zhan, Yun; Shang, Hanlin; Cheng, Kaihui; Ma, Yuqi; Mu, Shan; Zhou, Hang; Wang, Jingdong; Zhu, Siyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.00733 (cs)

[Submitted on 1 Dec 2024 (v1), last revised 4 Jan 2025 (this version, v3)]

Title:Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Authors:Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu

View PDF HTML (experimental)

Abstract:Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
Cite as:	arXiv:2412.00733 [cs.CV]
	(or arXiv:2412.00733v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.00733

Submission history

From: Yun Zhan [view email]
[v1] Sun, 1 Dec 2024 08:54:30 UTC (31,867 KB)
[v2] Thu, 5 Dec 2024 02:55:56 UTC (21,834 KB)
[v3] Sat, 4 Jan 2025 06:49:09 UTC (21,834 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators