Cross-Modal Quantization for Co-Speech Gesture Generation
Pages 10251 - 10263
Abstract
Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modalities. In this work, we propose a novel Cross-modal Quantization (CMQ) to jointly learn the quantized codes for speech and gesture together. Such representation highlights the speech-gesture interaction before actually learning the complex mapping, and thus better suits the intricate mapping between speech and gesture. Specifically, the Cross-modal Quantizer jointly encodes speech and gesture as discrete codebooks, enabling better cross-modal interaction. Cross-modal Predictor subsequently utilizes the learned codebooks to autoregressively predict the next-step gesture. With cross-modal quantization, our approach yields much higher codebook usage and generates more realistic and diverse gestures in practice. Extensive experiments are conducted on both 3D and 2D datasets as well as the subjective user study, demonstrating a clear performance gain compared to several baseline models in terms of audio-visual alignment and gesture diversity. In particular, our method demonstrates a three-fold improvement in diversity compared to baseline models, while simultaneously maintaining high motion fidelity.
References
[1]
S. Ginosar et al., “Learning individual styles of conversational gesture,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3497–3506.
[2]
S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style-controllable speech-driven gesture synthesis using normalising flows,” in Computer Graphics Forum, vol. 39, no. 2. Hoboken, NJ, USA: Wiley, 2020, pp. 487–496.
[3]
S. Qian, Z. Tu, Y. Zhi, W. Liu, and S. Gao, “Speech drives templates: Co-speech gesture synthesis with learned templates,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11077–11086.
[4]
J. Li et al., “Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11293–11302.
[5]
J. Xu, W. Zhang, Y. Bai, Q. Sun, and T. Mei, “Freeform body motion generation from speech,” 2022, arXiv:2203.02291.
[6]
Y. Yoon et al., “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Trans. Graph., vol. 39, no. 6, pp. 1–16, 2020.
[7]
Y. Yoon et al., “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 4303–4309.
[8]
X. Liu et al., “Learning hierarchical cross-modal association for co-speech gesture generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10462–10472.
[9]
T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” ACM Trans. Graph., vol. 41, no. 6, pp. 1–19, 2022.
[10]
H. Liu et al., “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 612–630.
[11]
L. Zhu et al., “Taming diffusion models for audio-driven co-speech gesture generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10544–10553.
[12]
T. Ao, Z. Zhang, and L. Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,” ACM Trans. Graph., vol. 42, no. 4, pp. 1–18, 2023.
[13]
M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Commun., vol. 54, no. 4, pp. 543–565, 2012.
[14]
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7291–7299.
[15]
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. Int. Conf. Learn. Representations (ICLR), 2014.
[16]
J. Cassell, D. McNeill, and K.-E. McCullough, “Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information,” Pragmatics Cogn., vol. 7, no. 1, pp. 1–34, 1999.
[17]
M. Studdert-Kennedy, “Hand and mind: What gestures reveal about thought,” Lang. Speech, vol. 37, no. 2, pp. 203–209, 1994.
[18]
P. Wagner, Z. Malisz, and S. Kopp, “Gesture and speech in interaction: An overview,” Speech Commun., vol. 57, pp. 209–232, 2014.
[19]
K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 484–492.
[20]
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing Obama: Learning lip sync from audio,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–13, 2017.
[21]
L. Chen et al., “Talking-head generation with rhythmic head motion,” in Proc. 16th Eur. Conf. Comput. Vis, 2020, pp. 35–51.
[22]
D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10101–10111.
[23]
A. Richard, M. Zollhöfer, Y. Wen, F. De la Torre, and Y. Sheikh, “Meshtalk: 3D face animation from speech using cross-modality disentanglement,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1173–1182.
[24]
X. Ji et al., “Audio-driven emotional video portraits,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14080–14089.
[25]
S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,” IEEE Trans. Multimedia, vol. 24, pp. 3480–3490, 2022.
[26]
Z. Ye et al., “Audio-driven talking face video generation with dynamic convolution kernels,” IEEE Trans. Multimedia, vol. 25, pp. 2033–2046, 2023.
[27]
L. Yu, H. Xie, and Y. Zhang, “Multimodal learning for temporally coherent talking face generation with articulator synergy,” IEEE Trans. Multimedia, vol. 24, pp. 2950–2962, 2022.
[28]
E. Ng et al., “Learning to listen: Modeling non-deterministic dyadic facial motion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 20395–20405.
[29]
M. Zhou et al., “Responsive listening head generation: A benchmark dataset and baseline,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 124–142.
[30]
R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “AI choreographer: Music conditioned 3d dance generation with aist,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13401–13412.
[31]
B. Li, Y. Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music conditioned 3D dance generation with parametric motion transformer,” in Proc. AAAI Conf. Artif. Intell., 2022, vol. 36, no. 2, pp. 1272–1279.
[32]
R. Huang et al., “Dance revolution: Long-term dance generation with music via curriculum learning,” in Proc. Int. Conf. Learn. Representations (ICLR), 2021.
[33]
E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman, “Audio to body dynamics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7574–7583.
[34]
I. Poggi, C. Pelachaud, F. de Rosis, V. Carofiglio, and B. De Carolis, “Greta. a believable embodied conversational agent,” in Multimodal Intelligent Information Presentation. Dordrecht, The Netherlands: Springer 2005, pp. 3–25.
[35]
S. Kopp and I. Wachsmuth, “Synthesizing multimodal utterances for conversational agents,” Comput. Animation Virtual Worlds, vol. 15, no. 1, pp. 39–52, 2004.
[36]
J. Cassell et al., “Animated conversation: Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents,” in Proc. 21st Annu. Conf. Comput. Graph. Interactive Techn., 1994, pp. 413–420.
[37]
C.-M. Huang and B. Mutlu, “Robot behavior toolkit: Generating effective social behaviors for robots,” in Proc. 7th Annu. ACM/IEEE Int. Conf. Hum.-Robot Interact., 2012, pp. 25–32.
[38]
S. Marsella et al., “Virtual character performance from speech,” in Proc. 12th ACM SIGGRAPH/Eurograph. Symp. Comput. Animation, 2013, pp. 25–35.
[39]
J. Cassell, H. H. Vilhjálmsson, and T. Bickmore, “Beat: The behavior expression animation toolkit,” in Proc. 28th Annu. Conf. Comput. Graph. Interactive Techn., 2001, pp. 477–486.
[40]
C.-C. Chiu and S. Marsella, “How to train your avatar: A. data driven approach to gesture generation,” in Proc. Intell. Virtual Agents: 10th Int. Conf., 2011, pp. 127–140.
[41]
C.-C. Chiu and S. Marsella, “Gesture generation with low-dimensional embeddings,” in Proc. Int. Conf. Auton. Agents Multi-Agent Syst., 2014, pp. 781–788.
[42]
I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[43]
D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Proc. Adv. Neural Inf. Process. Syst., vol. 31, 2018.
[44]
A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach, Learn., 2021, pp. 8748–8763.
[45]
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6840–6851.
[46]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10684–10695.
[47]
J. Li et al., “Align before fuse: Vision and language representation learning with momentum distillation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 9694–9705.
[48]
J. Li, D. Li, C. Xiong, and S. Hoi, “BlLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 12888–12900.
[49]
B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” in Proc. Int. Conf. Learn. Representations (ICLR), 2022.
[50]
P. Sarkar and A. Etemad, “Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity,” in Proc. AAAI Conf. Artif. Intell., 2023, vol. 37, no. 8, pp. 9723–9732.
[51]
A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extending clip to image, text and audio,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 976–980.
[52]
A. Van Den Oord et al., “Neural discrete representation learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, vol. 30, pp. 6306–6315.
[53]
P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12873–12883.
[54]
D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image generation using residual quantization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11523–11532.
[55]
T. Li et al., “Mage: Masked generative encoder to unify representation learning and image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2142–2152.
[56]
K. Yang, D. Marković, S. Krenn, V. Agrawal, and A. Richard, “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8227–8237.
[57]
C. Ahuja, D. W. Lee, R. Ishii, and L.-P. Morency, “No gestures left behind: Learning relationships between spoken language and freeform gestures,” in Proc. Findings Assoc. Comput. Linguistics: EMNLP, 2020, pp. 1884–1895.
[58]
L. Kaiser et al., “Fast decoding in sequence models using discrete latent variables,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 2390–2399.
[59]
M. Li et al., “Rhythm-aware sequence-to-sequence learning for labanotation generation with gesture-sensitive graph convolutional encoding,” IEEE Trans. Multimedia, vol. 24, pp. 1488–1502, 2022.
[60]
Y. Yan, B. Ni, W. Zhang, J. Xu, and X. Yang, “Structure-constrained motion sequence generation,” IEEE Trans. Multimedia, vol. 21, no. 7, pp. 1799–1812, Jul. 2019.
[61]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, vol. 30, pp. 5998–6008.
[62]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., Minneapolis, MN, USA, Jun. 2019, vol. 1, pp. 4171–4186.
[63]
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI Blog, vol. 1, no. 8, pp. 1–12, 2018.
[64]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. 9th Int. Conf. Learn. Representations (ICLR), 2021.
[65]
N. Carion et al., “End-to-end object detection with transformers,” in Proc. 16th Eur. Conf. Computer Vis., 2020, pp. 213–229.
[66]
Y. Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proc. 18th Int. Conf. Intell. Virtual Agents, 2018, pp. 93–98.
[67]
B. McFee et al., “Librosa: Audio and music signal analysis in python,” in Proc. 14th Python Sci. Conf., 2015, vol. 8, pp. 18–25.
[68]
Adobe Inc., “Mixamo,” Website, 2023. Accessed: Apr. 14, 2023. [Online]. Available: https://www.mixamo.com/
[69]
H. Liu et al., “Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis,” in Proc. 30th ACM Int. Conf. Multimedia, 2022, pp. 3764–3773.
Index Terms
- Cross-Modal Quantization for Co-Speech Gesture Generation
Index terms have been assigned to the content through auto-classification.
Recommendations
Facilitating multiparty dialog with gaze, gesture, and speech
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal InteractionWe study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We begin by reviewing a computational framework for turn-taking that provides the foundation ...
Automatic inference of cross-modal nonverbal interactions in multiparty conversations: "who responds to whom, when, and how?" from gaze, head gestures, and utterances
ICMI '07: Proceedings of the 9th international conference on Multimodal interfacesA novel probabilistic framework is proposed for analyzing cross-modal nonverbal interactions in multiparty face-to-face conversations. The goal is to determine "who responds to whom, when, and how" from multimodal cues including gaze, head gestures, and ...
Comments
Information & Contributors
Information
Published In
1520-9210 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
Publisher
IEEE Press
Publication History
Published: 27 May 2024
Qualifiers
- Research-article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 0Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025