skip to main content
10.1145/3590003.3590004acmotherconferencesArticle/Chapter ViewAbstractPublication PagescacmlConference Proceedingsconference-collections
research-article

Virtual Human Talking-Head Generation

Published: 29 May 2023 Publication History

Abstract

Abstract: Virtual humans created by computers using deep learning technology are being used widely in a variety of fields, including personal assistance, intelligent customer service, and online education. Human-computer interaction systems integrate multi-modal technologies like speech recognition, dialogue systems, speech synthesis, and virtual digital human video synthesis as one of the applications of virtual humans. In this paper, we first design the framework for a human-computer interaction system based on a virtual human; next, we classify the talking head video synthesis model according to the generation of a virtual human's depth; finally, we conduct a systematic review of the technical developments in talking head video generation over the last five years, highlighting seminal work.

References

[1]
Wang Zhaoqi, "A review of virtual human synthesis", Journal of Chinese Academy of Sciences, vol. 17, no. 2, pp. 89, 2000.
[2]
Chen Qixiang and Wei Kejun, Research on virtual human technology China water transportation, Academic, pp. 5, 2006.
[3]
Thies J, Zollhofer M, Stamminger M, Face2face: Real-time face capture and reenactment of rgb videos[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2387-2395.
[4]
J. S. Chung, A. Zisserman, Out of time: automated lip sync in the wild, in: Asian conference on computer vision (ACCV), 2016, pp. 251–263.
[5]
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM ToG, vol. 36, no. 4, pp. 1–13, 2017.
[6]
J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?” in BMVC, 2017.
[7]
Karras T, Aila T, Laine S, Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.
[8]
Kumar R, Sotelo J, Kumar K, Obamanet: Photo-realistic lip-sync from text[J]. arXiv preprint arXiv:1801.01442, 2017.
[9]
Chen L, Li Z, Maddox R K, Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 520-535.
[10]
Kim H, Garrido P, Tewari A, Deep video portraits[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 1-14.
[11]
Vougioukas K, Petridis S, Pantic M. End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs[C]//CVPR Workshops. 2019: 37-40.
[12]
Song Y, Zhu J, Li D, Talking face generation by conditional recurrent adversarial network[J]. arXiv preprint arXiv:1804.04786, 2018.
[13]
H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, vol. 33, no. 01, 2019, pp. 9299–9306.
[14]
L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” in CVPR, 2019, pp. 7832–7841.
[15]
Yu L, Yu J, Ling Q. Mining audio, text and visual information for talking face generation[C]//2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019: 787-795.
[16]
Cudeiro D, Bolkart T, Laidlaw C, Capture, learning, and synthesis of 3D speaking styles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 10101-10111.
[17]
Fried O, Tewari A, Zollhöfer M, Text-based editing of talking-head video[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 1-14.
[18]
Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM TOG, vol. 39, no. 6, pp. 1–15, 2020.
[19]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.
[20]
Thies J, Elgharib M, Tewari A, Neural voice puppetry: Audio-driven facial reenactment[C]//European conference on computer vision. Springer, Cham, 2020: 716-731.
[21]
W. Chen, X. Tan, Y. Xia, T. Qin, Y. Wang, and T.-Y. Liu, “Duallip: A system for joint lip reading and generation,” in ACM MM, 2020, pp. 1985–1993.
[22]
Guo Y, Chen K, Liang S, Ad-nerf: Audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5784-5794.
[23]
Li L, Wang S, Zhang Z, Write-a-speaker: Text-based emotional and rhythmic talking-head generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 1911-1920.
[24]
Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “Faceformer: Speechdriven 3d facial animation with transformers,” arXiv:2112.05329, 2021.
[25]
C.-C. Yang, W.-C. Fan, C.-F. Yang, and Y.-C. F. Wang, “Crossmodal mutual learning for audio-visual speech recognition and manipulation,” in AAAI, 2022.
[26]
S. Zhang, J. Yuan, M. Liao and L. Zhang, "Text2video: Text-Driven Talking-Head Video Synthesis with Personalized Phoneme - Pose Dictionary," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2659-266.
[27]
Bregler C, Covell M, Slaney M. Video rewrite: Driving visual speech with audio[C]//Proceedings of the 24th annual conference on Computer graphics and interactive techniques. 1997: 353-360.
[28]
Ye Z, Xia M, Yi R, Audio-driven talking face video generation with dynamic convolution kernels[J]. IEEE Transactions on Multimedia, 2022.
[29]
Li T, Bolkart T, Black M J, Learning a model of facial shape and expression from 4D scans[J]. ACM Trans. Graph., 2017, 36(6): 194:1-194:17.
[30]
Chen L, Wu Z, Ling J, Transformer-S2A: Robust and Efficient Speech-to-Animation[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 7247-7251.
[31]
Hong Y, Peng B, Xiao H, Headnerf: A real-time nerf-based parametric head model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 20374-20384.
[32]
Neff T, Stadlbauer P, Parger M, DONeRF: Towards Real‐Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks[C]//Computer Graphics Forum. 2021, 40(4): 45-59.
[33]
Yu A, Li R, Tancik M, Plenoctrees for real-time rendering of neural radiance fields[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 5752-5761.
[34]
Martin-Brualla R, Radwan N, Sajjadi M S M, Nerf in the wild: Neural radiance fields for unconstrained photo collections[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7210-7219.
[35]
Huang Y, Zhu Y, Qiao X, Aitransfer: Progressive ai-powered transmission for real-time point cloud video streaming[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 3989-3997.

Index Terms

  1. Virtual Human Talking-Head Generation
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning
    March 2023
    598 pages
    ISBN:9781450399449
    DOI:10.1145/3590003
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-modal Human Computer Interaction
    2. Talking-head Generation
    3. Virtual Human

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • State Key Laboratory of Media Convergence Production Technology and Systems

    Conference

    CACML 2023

    Acceptance Rates

    CACML '23 Paper Acceptance Rate 93 of 241 submissions, 39%;
    Overall Acceptance Rate 93 of 241 submissions, 39%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 133
      Total Downloads
    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media