research-article

Virtual Human Talking-Head Generation

Authors:

Guowei ChenAuthors Info & Claims

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

Pages 1 - 5

https://doi.org/10.1145/3590003.3590004

Published: 29 May 2023 Publication History

Abstract

Abstract: Virtual humans created by computers using deep learning technology are being used widely in a variety of fields, including personal assistance, intelligent customer service, and online education. Human-computer interaction systems integrate multi-modal technologies like speech recognition, dialogue systems, speech synthesis, and virtual digital human video synthesis as one of the applications of virtual humans. In this paper, we first design the framework for a human-computer interaction system based on a virtual human; next, we classify the talking head video synthesis model according to the generation of a virtual human's depth; finally, we conduct a systematic review of the technical developments in talking head video generation over the last five years, highlighting seminal work.

References

[1]

Wang Zhaoqi, "A review of virtual human synthesis", Journal of Chinese Academy of Sciences, vol. 17, no. 2, pp. 89, 2000.

[2]

Chen Qixiang and Wei Kejun, Research on virtual human technology China water transportation, Academic, pp. 5, 2006.

[3]

Thies J, Zollhofer M, Stamminger M, Face2face: Real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2387-2395.

[4]

J. S. Chung, A. Zisserman, Out of time: automated lip sync in the wild, in: Asian conference on computer vision (ACCV), 2016, pp. 251–263.

[5]

S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM ToG, vol. 36, no. 4, pp. 1–13, 2017.

Digital Library

[6]

J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017.

[7]

Karras T, Aila T, Laine S, Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.

Digital Library

[8]

Kumar R, Sotelo J, Kumar K, Obamanet: Photo-realistic lip-sync from text[J]. arXiv preprint arXiv:1801.01442, 2017.

[9]

Chen L, Li Z, Maddox R K, Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.

[10]

Kim H, Garrido P, Tewari A, Deep video portraits[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 1-14.

Digital Library

[11]

Vougioukas K, Petridis S, Pantic M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs[C]//CVPR Workshops. 2019: 37-40.

[12]

Song Y, Zhu J, Li D, Talking face generation by conditional recurrent adversarial network[J]. arXiv preprint arXiv:1804.04786, 2018.

[13]

H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, vol. 33, no. 01, 2019, pp. 9299–9306.

Digital Library

[14]

L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in CVPR, 2019, pp. 7832–7841.

[15]

Yu L, Yu J, Ling Q. Mining audio, text and visual information for talking face generation[C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.

[16]

Cudeiro D, Bolkart T, Laidlaw C, Capture, learning, and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10101-10111.

[17]

Fried O, Tewari A, Zollhöfer M, Text-based editing of talking-head video[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 1-14.

Digital Library

[18]

Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM TOG, vol. 39, no. 6, pp. 1–15, 2020.

Digital Library

[19]

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.

Digital Library

[20]

Thies J, Elgharib M, Tewari A, Neural voice puppetry: Audio-driven facial reenactment[C]//European conference on computer vision. Springer, Cham, 2020: 716-731.

[21]

W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “Duallip: A system for joint lip reading and generation,” in ACM MM, 2020, pp. 1985–1993.

[22]

Guo Y, Chen K, Liang S, Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.

[23]

Li L, Wang S, Zhang Z, Write-a-speaker: Text-based emotional and rhythmic talking-head generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 1911-1920.

[24]

Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speechdriven 3d facial animation with transformers,” arXiv:2112.05329, 2021.

[25]

C.-C. Yang, W.-C. Fan, C.-F. Yang, and Y.-C. F. Wang, “Crossmodal mutual learning for audio-visual speech recognition and manipulation,” in AAAI, 2022.

[26]

S. Zhang, J. Yuan, M. Liao and L. Zhang, "Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2659-266.

[27]

Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio[C]//Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 1997: 353-360.

[28]

Ye Z, Xia M, Yi R, Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.

Digital Library

[29]

Li T, Bolkart T, Black M J, Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.

Digital Library

[30]

Chen L, Wu Z, Ling J, Transformer-S2A: Robust and Efficient Speech-to-Animation[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7247-7251.

[31]

Hong Y, Peng B, Xiao H, Headnerf: A real-time nerf-based parametric head model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 20374-20384.

[32]

Neff T, Stadlbauer P, Parger M, DONeRF: Towards Real‐Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks[C]//Computer Graphics Forum. 2021, 40(4): 45-59.

[33]

Yu A, Li R, Tancik M, Plenoctrees for real-time rendering of neural radiance fields[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5752-5761.

[34]

Martin-Brualla R, Radwan N, Sajjadi M S M, Nerf in the wild: Neural radiance fields for unconstrained photo collections[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7210-7219.

[35]

Huang Y, Zhu Y, Qiao X, Aitransfer: Progressive ai-powered transmission for real-time point cloud video streaming[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 3989-3997.

Index Terms

Virtual Human Talking-Head Generation

Index terms have been assigned to the content through auto-classification.

Recommendations

Human-virtual human interaction by upper body gesture understanding
VRST '13: Proceedings of the 19th ACM Symposium on Virtual Reality Software and Technology

In this paper, a novel human-virtual human interaction system is proposed. This system supports a real human to communicate with a virtual human using natural body language. Meanwhile, the virtual human is capable of understanding the meaning of human ...
Improving Social Presence with a Virtual Human via Multimodal Physical -- Virtual Interactivity in AR
CHI EA '18: Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems

In a social context where a real human interacts with a virtual human (VH) in the same space, one's sense of social/co-presence with the VH is an important factor for the quality of interaction and the VH's social influence to the human user in context. ...
Human perception of a conversational virtual human: an empirical study on the effect of emotion and culture

Virtual reality applications with virtual humans, such as virtual reality exposure therapy, health coaches and negotiation simulators, are developed for different contexts and usually for users from different countries. The emphasis on a virtual human's ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

March 2023

598 pages

ISBN:9781450399449

DOI:10.1145/3590003

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

State Key Laboratory of Media Convergence Production Technology and Systems

Conference

CACML 2023

CACML 2023: 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

March 17 - 19, 2023

Shanghai, China

Acceptance Rates

CACML '23 Paper Acceptance Rate 93 of 241 submissions, 39%;

Overall Acceptance Rate 93 of 241 submissions, 39%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
133
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)14

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents