research-article

Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting

Authors:

Kayvon Fatahalian,

Maneesh AgrawalaAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 40, Issue 3

Article No.: 20, Pages 1 - 14

https://doi.org/10.1145/3449063

Published: 01 August 2021 Publication History

Abstract

We present a text-based tool for editing talking-head video that enables an iterative editing workflow. On each iteration users can edit the wording of the speech, further refine mouth motions if necessary to reduce artifacts, and manipulate non-verbal aspects of the performance by inserting mouth gestures (e.g., a smile) or changing the overall performance style (e.g., energetic, mumble). Our tool requires only 2 to 3 minutes of the target actor video and it synthesizes the video for each iteration in about 40 seconds, allowing users to quickly explore many editing possibilities as they iterate. Our approach is based on two key ideas. (1) We develop a fast phoneme search algorithm that can quickly identify phoneme-level subsequences of the source repository video that best match a desired edit. This enables our fast iteration loop. (2) We leverage a large repository of video of a source actor and develop a new self-supervised neural retargeting technique for transferring the mouth motions of the source actor to the target actor. This allows us to work with relatively short target actor videos, making our approach applicable in many real-world editing scenarios. Finally, our, refinement and performance controls give users the ability to further fine-tune the synthesized results.

Supplementary Material

yao (yao.zip)

Supplemental movie, appendix, image and software files for, Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting

Download
31.15 KB

References

[1]

Google LLC. 2020a. Google Cloud Speech to Text API. https://cloud.google.com/speech-to-text

[2]

Google LLC. 2020b. Google Cloud Text to Speech API. https://cloud.google.com/text-to-speech

[3]

Descript Inc. 2020. Lyrebird AI. https://www.descript.com/lyrebird-ai

[4]

Rev.com Inc. 2020. Rev. https://rev.com

[5]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.

[6]

Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. 2020. Detecting deep-fake videos from phoneme-viseme mismatches. In Workshop on Media Forensics at Conference on Computer Vision and Pattern Recognition (CVPR'20). Seattle, WA.

[7]

Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017. Bringing portraits to life. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36, 6 (Nov. 2017), 196:1–13.

[8]

Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. Graph. 31, 4, (July 2012), Article 67, 8 pages.

Digital Library

[9]

Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’99). ACM Press/Addison-Wesley Publishing Co., 187–194.

Digital Library

[10]

Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’97). ACM Press/Addison-Wesley Publishing Co., 353–360.

Digital Library

[11]

Yao-Jen Chang and Tony Ezzat. 2005. Transferable videorealistic speech animation. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA'05). Association for Computing Machinery, 143–151.

Digital Library

[12]

Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV’18). 520–535.

[13]

Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7832–7841.

[14]

Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? In British Machine Vision Conference.

[15]

Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. (TOG) 35, 4 (2016), 1–11.

Digital Library

[16]

Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable videorealistic speech animation. ACM Trans. Graph. 21, 3 (July 2002), 388–398.

Digital Library

[17]

Ohad Fried and Maneesh Agrawala. 2019. Puppet dubbing. In Eurographics Symposium on Rendering - DL-only and Industry Track, Tamy Boubekeur and Pradeep Sen (Eds.). The Eurographics Association.

[18]

Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video. ACM Trans. Graph. 38, 4, (July 2019), Article 68, 14 pages.

Digital Library

[19]

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, and David S. Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report n 93 (1993).

[20]

P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormaehlen, P. Perez, and C. Theobalt. 2014. Automatic face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA. 4217--4224.

Digital Library

[21]

Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2015. VDub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. Comput. Graph. Forum 34, 2 (May 2015), 193–204.

Digital Library

[22]

Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. 35, 3, (May 2016), Article 28, 15 pages.

Digital Library

[23]

Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. 2018. Warp-guided GANs for single-photo facial animation. In SIGGRAPH Asia 2018 Technical Papers (SIGGRAPH Asia’18). ACM, New York, NY, Article 231, 231:1–231:12 pages. http://doi.acm.org/10.1145/3272127.3275043

[24]

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems. 4480–4490.

[25]

Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010. Being John Malkovich. In Proceedings of the European Conference on Computer Vision (ECCV'10). 341–353.

Digital Library

[26]

Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preserving visual dubbing. ACM Trans. Graph. 38, 6, (Nov. 2019), Article 178, 13 pages.

Digital Library

[27]

H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. 2018. Deep video portraits. ACM Trans. Graph. 37, 4, Article 163 (August 2018), 14 pages.

Digital Library

[28]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR’15), Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

[29]

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. arxiv:eess.AS/1910.06711

[30]

Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, and Yoshua Bengio. 2017. ObamaNet: Photo-realistic lip-sync from text. arxiv:cs.CV/1801.01442

[31]

V.I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, Vol. 10. 707.

[32]

K. Liu and J. Ostermann. 2011. Realistic facial expression synthesis for an image-based talking head. In IEEE International Conference on Multimedia and Expo, Barcelona, Spain. 1--6.

Digital Library

[33]

Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2010. Optimized photorealistic audiovisual speech synthesis using active appearance modeling. In Auditory-Visual Speech Processing. 8–1.

[34]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning. 1310–1318.

Digital Library

[35]

Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. ACM, 181–190.

Digital Library

[36]

A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. 2019. GANimation: One-shot anatomically consistent facial animation. Int J. Comput. Vis. 128 (2020), 698--713. https://doi.org/10.1007/s11263-019-01210-3

[37]

Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology. ACM, 113–122.

Digital Library

[38]

Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. 2019. Talking face generation by conditional recurrent adversarial network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 919–925.

[39]

Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Trans. Graph. 36, 4, (July 2017), Article 95, 13 pages.

Digital Library

[40]

Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, Rohit Pandey, Sean Fanello, Gordon Wetzstein, Jun-Yan Zhu, Christian Theobalt, Maneesh Agrawala, Eli Shechtman, Dan B. Goldman, and Michael Zollhöfer. 2020. State of the art on neural rendering. Comput. Graph. Forum 39, 2 (2020), 701--727.

[41]

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision. Springer, 716–731.

Digital Library

[42]

Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2Face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2387–2395.

Digital Library

[43]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. In Arxiv. https://arxiv.org/abs/1609.03499

[44]

Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face transfer with multilinear models. ACM Trans. Graph. 24, 3 (July 2005), 426–433.

Digital Library

[45]

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-End Speech-Driven Facial Animation with Temporal GANs. arxiv:eess.AS/1805.09313

[46]

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 128 (2019), 1398--1413.

Digital Library

[47]

Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. JACM 21, 1 (1974), 168–173.

Digital Library

[48]

Lijuan Wang, Wei Han, Frank Soong, and Qiang Huo. 2011. Text-driven 3D photo-realistic talking head. In INTERSPEECH 2011 (interspeech 2011 ed.). International Speech Communication Association. https://www.microsoft.com/en-us/research/publication/text-driven-3d-photo-realistic-talking-head/

[49]

O. Wiles, A.S. Koepke, and A. Zisserman. 2018. X2Face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision.

[50]

Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. In Proceedings of Acoustics 2008. Citeseer.

[51]

Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. arxiv:cs.CV/1905.08233

[52]

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In AAAI Conference on Artificial Intelligence (AAAI’19).

[53]

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600–612.

Digital Library

Cited By

Zhang LLiang SGe ZHu T(2024)PersonaTalk: Bring Attention to Your Persona in Visual DubbingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687618(1-9)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687618
Sheng CKuang GBai LHou CGuo YXu XPietikäinen MLiu L(2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
https://doi.org/10.1109/TPAMI.2024.3376710
Fang HWeng DTian ZMa YLu X(2024)Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarksComputers & Graphics10.1016/j.cag.2024.103925120(103925)Online publication date: May-2024
https://doi.org/10.1016/j.cag.2024.103925
Show More Cited By

Index Terms

Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
  2. Computer graphics

Recommendations

Text-based editing of talking-head video

Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has ...
Artist friendly facial animation retargeting

This paper presents a novel facial animation retargeting system that is carefully designed to support the animator's workflow. Observation and analysis of the animators' often preferred process of key-frame animation with blendshape models informed our ...
Artist friendly facial animation retargeting
SA '11: Proceedings of the 2011 SIGGRAPH Asia Conference

This paper presents a novel facial animation retargeting system that is carefully designed to support the animator's workflow. Observation and analysis of the animators' often preferred process of key-frame animation with blendshape models informed our ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 40, Issue 3

June 2021

264 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3463476

Editor:
Marc Alexa
TU Berlin, Germany

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2021

Accepted: 01 February 2021

Revised: 01 January 2021

Received: 01 June 2020

Published in TOG Volume 40, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
406
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)4

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang LLiang SGe ZHu T(2024)PersonaTalk: Bring Attention to Your Persona in Visual DubbingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687618(1-9)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687618
Sheng CKuang GBai LHou CGuo YXu XPietikäinen MLiu L(2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
https://doi.org/10.1109/TPAMI.2024.3376710
Fang HWeng DTian ZMa YLu X(2024)Audio-to-Deep-Lip: Speaking lip synthesis based on 3D landmarksComputers & Graphics10.1016/j.cag.2024.103925120(103925)Online publication date: May-2024
https://doi.org/10.1016/j.cag.2024.103925
Mahapatra AMishra RLi RChen ZDing BWang SZhu JChang PHan MXiao J(2024)Co-speech Gesture Video Generation with 3D Human MeshesComputer Vision – ECCV 202410.1007/978-3-031-73024-5_11(172-189)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73024-5_11
Bigioi DCorcoran P(2023)Multilingual video dubbing—a technology review and current challengesFrontiers in Signal Processing10.3389/frsip.2023.12307553Online publication date: 25-Sep-2023
https://doi.org/10.3389/frsip.2023.1230755
Yang SWang WLing JPeng BTan XDong JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Context-Aware Talking-Head Video EditingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611765(7718-7727)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611765
Dai SLiu JDou ZWang HLiu LLong BWen JSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Contrastive Learning for User Sequence Representation in Personalized Product SearchProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599287(380-389)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599287
Somanchi SAbbasi AKelley KDobolyi DYuan T(2023)Examining User Heterogeneity in Digital ExperimentsACM Transactions on Information Systems10.1145/357893141:4(1-34)Online publication date: 22-Mar-2023
https://dl.acm.org/doi/10.1145/3578931
Agarwal ASen BMukhopadhyay RNamboodiri VJawahar C(2023)Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00225(2216-2225)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00225
Wang KZhang JHuang JLi QSun MSakai KKu W(2023)CA-Wav2Lip: Coordinate Attention-based Speech To Lip Synthesis In The Wild2023 IEEE International Conference on Smart Computing (SMARTCOMP)10.1109/SMARTCOMP58114.2023.00018(1-8)Online publication date: Jun-2023
https://doi.org/10.1109/SMARTCOMP58114.2023.00018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents