Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Qi, Xingqun; Pan, Jiahao; Li, Peng; Yuan, Ruibin; Chi, Xiaowei; Li, Mengfei; Luo, Wenhan; Xue, Wei; Zhang, Shanghang; Liu, Qifeng; Guo, Yike

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.17532 (cs)

[Submitted on 29 Nov 2023 (v1), last revised 27 Mar 2024 (this version, v3)]

Title:Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Authors:Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo

View PDF HTML (experimental)

Abstract:Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: this https URL.

Comments:	Accepted by CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.17532 [cs.CV]
	(or arXiv:2311.17532v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.17532

Submission history

From: Xingqun Qi [view email]
[v1] Wed, 29 Nov 2023 11:10:40 UTC (2,058 KB)
[v2] Sun, 24 Dec 2023 07:37:10 UTC (2,060 KB)
[v3] Wed, 27 Mar 2024 15:01:22 UTC (4,598 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators