research-article

Cross-Modal Quantization for Co-Speech Gesture Generation

Authors:

Tao MeiAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 26

Pages 10251 - 10263

https://doi.org/10.1109/TMM.2024.3405743

Published: 27 May 2024 Publication History

Abstract

Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modalities. In this work, we propose a novel Cross-modal Quantization (CMQ) to jointly learn the quantized codes for speech and gesture together. Such representation highlights the speech-gesture interaction before actually learning the complex mapping, and thus better suits the intricate mapping between speech and gesture. Specifically, the Cross-modal Quantizer jointly encodes speech and gesture as discrete codebooks, enabling better cross-modal interaction. Cross-modal Predictor subsequently utilizes the learned codebooks to autoregressively predict the next-step gesture. With cross-modal quantization, our approach yields much higher codebook usage and generates more realistic and diverse gestures in practice. Extensive experiments are conducted on both 3D and 2D datasets as well as the subjective user study, demonstrating a clear performance gain compared to several baseline models in terms of audio-visual alignment and gesture diversity. In particular, our method demonstrates a three-fold improvement in diversity compared to baseline models, while simultaneously maintaining high motion fidelity.

References

[1]

S. Ginosar et al., “Learning individual styles of conversational gesture,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3497–3506.

[2]

S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style-controllable speech-driven gesture synthesis using normalising flows,” in Computer Graphics Forum, vol. 39, no. 2. Hoboken, NJ, USA: Wiley, 2020, pp. 487–496.

[3]

S. Qian, Z. Tu, Y. Zhi, W. Liu, and S. Gao, “Speech drives templates: Co-speech gesture synthesis with learned templates,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11077–11086.

[4]

J. Li et al., “Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 11293–11302.

[5]

J. Xu, W. Zhang, Y. Bai, Q. Sun, and T. Mei, “Freeform body motion generation from speech,” 2022, arXiv:2203.02291.

[6]

Y. Yoon et al., “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Trans. Graph., vol. 39, no. 6, pp. 1–16, 2020.

Digital Library

[7]

Y. Yoon et al., “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 4303–4309.

[8]

X. Liu et al., “Learning hierarchical cross-modal association for co-speech gesture generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10462–10472.

[9]

T. Ao, Q. Gao, Y. Lou, B. Chen, and L. Liu, “Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings,” ACM Trans. Graph., vol. 41, no. 6, pp. 1–19, 2022.

Digital Library

[10]

H. Liu et al., “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 612–630.

[11]

L. Zhu et al., “Taming diffusion models for audio-driven co-speech gesture generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10544–10553.

[12]

T. Ao, Z. Zhang, and L. Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,” ACM Trans. Graph., vol. 42, no. 4, pp. 1–18, 2023.

Digital Library

[13]

M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Commun., vol. 54, no. 4, pp. 543–565, 2012.

Digital Library

[14]

Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7291–7299.

[15]

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. Int. Conf. Learn. Representations (ICLR), 2014.

[16]

J. Cassell, D. McNeill, and K.-E. McCullough, “Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information,” Pragmatics Cogn., vol. 7, no. 1, pp. 1–34, 1999.

[17]

M. Studdert-Kennedy, “Hand and mind: What gestures reveal about thought,” Lang. Speech, vol. 37, no. 2, pp. 203–209, 1994.

[18]

P. Wagner, Z. Malisz, and S. Kopp, “Gesture and speech in interaction: An overview,” Speech Commun., vol. 57, pp. 209–232, 2014.

Digital Library

[19]

K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 484–492.

Digital Library

[20]

S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing Obama: Learning lip sync from audio,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–13, 2017.

Digital Library

[21]

L. Chen et al., “Talking-head generation with rhythmic head motion,” in Proc. 16th Eur. Conf. Comput. Vis, 2020, pp. 35–51.

[22]

D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. J. Black, “Capture, learning, and synthesis of 3d speaking styles,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10101–10111.

[23]

A. Richard, M. Zollhöfer, Y. Wen, F. De la Torre, and Y. Sheikh, “Meshtalk: 3D face animation from speech using cross-modality disentanglement,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 1173–1182.

[24]

X. Ji et al., “Audio-driven emotional video portraits,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14080–14089.

[25]

S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,” IEEE Trans. Multimedia, vol. 24, pp. 3480–3490, 2022.

Digital Library

[26]

Z. Ye et al., “Audio-driven talking face video generation with dynamic convolution kernels,” IEEE Trans. Multimedia, vol. 25, pp. 2033–2046, 2023.

Digital Library

[27]

L. Yu, H. Xie, and Y. Zhang, “Multimodal learning for temporally coherent talking face generation with articulator synergy,” IEEE Trans. Multimedia, vol. 24, pp. 2950–2962, 2022.

Digital Library

[28]

E. Ng et al., “Learning to listen: Modeling non-deterministic dyadic facial motion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 20395–20405.

[29]

M. Zhou et al., “Responsive listening head generation: A benchmark dataset and baseline,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 124–142.

[30]

R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “AI choreographer: Music conditioned 3d dance generation with aist,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13401–13412.

[31]

B. Li, Y. Zhao, S. Zhelun, and L. Sheng, “Danceformer: Music conditioned 3D dance generation with parametric motion transformer,” in Proc. AAAI Conf. Artif. Intell., 2022, vol. 36, no. 2, pp. 1272–1279.

[32]

R. Huang et al., “Dance revolution: Long-term dance generation with music via curriculum learning,” in Proc. Int. Conf. Learn. Representations (ICLR), 2021.

[33]

E. Shlizerman, L. Dery, H. Schoen, and I. Kemelmacher-Shlizerman, “Audio to body dynamics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7574–7583.

[34]

I. Poggi, C. Pelachaud, F. de Rosis, V. Carofiglio, and B. De Carolis, “Greta. a believable embodied conversational agent,” in Multimodal Intelligent Information Presentation. Dordrecht, The Netherlands: Springer 2005, pp. 3–25.

[35]

S. Kopp and I. Wachsmuth, “Synthesizing multimodal utterances for conversational agents,” Comput. Animation Virtual Worlds, vol. 15, no. 1, pp. 39–52, 2004.

Digital Library

[36]

J. Cassell et al., “Animated conversation: Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents,” in Proc. 21st Annu. Conf. Comput. Graph. Interactive Techn., 1994, pp. 413–420.

Digital Library

[37]

C.-M. Huang and B. Mutlu, “Robot behavior toolkit: Generating effective social behaviors for robots,” in Proc. 7th Annu. ACM/IEEE Int. Conf. Hum.-Robot Interact., 2012, pp. 25–32.

[38]

S. Marsella et al., “Virtual character performance from speech,” in Proc. 12th ACM SIGGRAPH/Eurograph. Symp. Comput. Animation, 2013, pp. 25–35.

Digital Library

[39]

J. Cassell, H. H. Vilhjálmsson, and T. Bickmore, “Beat: The behavior expression animation toolkit,” in Proc. 28th Annu. Conf. Comput. Graph. Interactive Techn., 2001, pp. 477–486.

[40]

C.-C. Chiu and S. Marsella, “How to train your avatar: A. data driven approach to gesture generation,” in Proc. Intell. Virtual Agents: 10th Int. Conf., 2011, pp. 127–140.

[41]

C.-C. Chiu and S. Marsella, “Gesture generation with low-dimensional embeddings,” in Proc. Int. Conf. Auton. Agents Multi-Agent Syst., 2014, pp. 781–788.

[42]

I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.

[43]

D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Proc. Adv. Neural Inf. Process. Syst., vol. 31, 2018.

[44]

A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach, Learn., 2021, pp. 8748–8763.

[45]

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 6840–6851.

[46]

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10684–10695.

[47]

J. Li et al., “Align before fuse: Vision and language representation learning with momentum distillation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 9694–9705.

[48]

J. Li, D. Li, C. Xiong, and S. Hoi, “BlLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 12888–12900.

[49]

B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” in Proc. Int. Conf. Learn. Representations (ICLR), 2022.

[50]

P. Sarkar and A. Etemad, “Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity,” in Proc. AAAI Conf. Artif. Intell., 2023, vol. 37, no. 8, pp. 9723–9732.

[51]

A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extending clip to image, text and audio,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 976–980.

[52]

A. Van Den Oord et al., “Neural discrete representation learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, vol. 30, pp. 6306–6315.

[53]

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12873–12883.

[54]

D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image generation using residual quantization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11523–11532.

[55]

T. Li et al., “Mage: Masked generative encoder to unify representation learning and image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2142–2152.

[56]

K. Yang, D. Marković, S. Krenn, V. Agrawal, and A. Richard, “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8227–8237.

[57]

C. Ahuja, D. W. Lee, R. Ishii, and L.-P. Morency, “No gestures left behind: Learning relationships between spoken language and freeform gestures,” in Proc. Findings Assoc. Comput. Linguistics: EMNLP, 2020, pp. 1884–1895.

[58]

L. Kaiser et al., “Fast decoding in sequence models using discrete latent variables,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 2390–2399.

[59]

M. Li et al., “Rhythm-aware sequence-to-sequence learning for labanotation generation with gesture-sensitive graph convolutional encoding,” IEEE Trans. Multimedia, vol. 24, pp. 1488–1502, 2022.

[60]

Y. Yan, B. Ni, W. Zhang, J. Xu, and X. Yang, “Structure-constrained motion sequence generation,” IEEE Trans. Multimedia, vol. 21, no. 7, pp. 1799–1812, Jul. 2019.

[61]

A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, vol. 30, pp. 5998–6008.

[62]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., Minneapolis, MN, USA, Jun. 2019, vol. 1, pp. 4171–4186.

[63]

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI Blog, vol. 1, no. 8, pp. 1–12, 2018.

[64]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. 9th Int. Conf. Learn. Representations (ICLR), 2021.

[65]

N. Carion et al., “End-to-end object detection with transformers,” in Proc. 16th Eur. Conf. Computer Vis., 2020, pp. 213–229.

[66]

Y. Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proc. 18th Int. Conf. Intell. Virtual Agents, 2018, pp. 93–98.

Digital Library

[67]

B. McFee et al., “Librosa: Audio and music signal analysis in python,” in Proc. 14th Python Sci. Conf., 2015, vol. 8, pp. 18–25.

[68]

Adobe Inc., “Mixamo,” Website, 2023. Accessed: Apr. 14, 2023. [Online]. Available: https://www.mixamo.com/

[69]

H. Liu et al., “Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis,” in Proc. 30th ACM Int. Conf. Multimedia, 2022, pp. 3764–3773.

Index Terms

Cross-Modal Quantization for Co-Speech Gesture Generation

Index terms have been assigned to the content through auto-classification.

Recommendations

Facilitating multiparty dialog with gaze, gesture, and speech
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction

We study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We begin by reviewing a computational framework for turn-taking that provides the foundation ...
Automatic inference of cross-modal nonverbal interactions in multiparty conversations: "who responds to whom, when, and how?" from gaze, head gestures, and utterances
ICMI '07: Proceedings of the 9th international conference on Multimodal interfaces

A novel probabilistic framework is proposed for analyzing cross-modal nonverbal interactions in multiparty face-to-face conversations. The goal is to determine "who responds to whom, when, and how" from multimodal cues including gaze, head gestures, and ...
Vector quantization-lattice vector quantization and its applications in speech coding

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 26, Issue

2024

11427 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 27 May 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents