skip to main content
research-article

Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting

Published: 01 August 2021 Publication History

Abstract

We present a text-based tool for editing talking-head video that enables an iterative editing workflow. On each iteration users can edit the wording of the speech, further refine mouth motions if necessary to reduce artifacts, and manipulate non-verbal aspects of the performance by inserting mouth gestures (e.g., a smile) or changing the overall performance style (e.g., energetic, mumble). Our tool requires only 2 to 3 minutes of the target actor video and it synthesizes the video for each iteration in about 40 seconds, allowing users to quickly explore many editing possibilities as they iterate. Our approach is based on two key ideas. (1) We develop a fast phoneme search algorithm that can quickly identify phoneme-level subsequences of the source repository video that best match a desired edit. This enables our fast iteration loop. (2) We leverage a large repository of video of a source actor and develop a new self-supervised neural retargeting technique for transferring the mouth motions of the source actor to the target actor. This allows us to work with relatively short target actor videos, making our approach applicable in many real-world editing scenarios. Finally, our, refinement and performance controls give users the ability to further fine-tune the synthesized results.

Supplementary Material

yao (yao.zip)
Supplemental movie, appendix, image and software files for, Iterative Text-Based Editing of Talking-Heads Using Neural Retargeting

References

[1]
Google LLC. 2020a. Google Cloud Speech to Text API. https://cloud.google.com/speech-to-text
[2]
Google LLC. 2020b. Google Cloud Text to Speech API. https://cloud.google.com/text-to-speech
[3]
Descript Inc. 2020. Lyrebird AI. https://www.descript.com/lyrebird-ai
[4]
Rev.com Inc. 2020. Rev. https://rev.com
[5]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[6]
Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. 2020. Detecting deep-fake videos from phoneme-viseme mismatches. In Workshop on Media Forensics at Conference on Computer Vision and Pattern Recognition (CVPR'20). Seattle, WA.
[7]
Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017. Bringing portraits to life. ACM Transactions on Graphics (Proceeding of SIGGRAPH Asia 2017) 36, 6 (Nov. 2017), 196:1–13.
[8]
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. Graph. 31, 4, (July 2012), Article 67, 8 pages.
[9]
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’99). ACM Press/Addison-Wesley Publishing Co., 187–194.
[10]
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’97). ACM Press/Addison-Wesley Publishing Co., 353–360.
[11]
Yao-Jen Chang and Tony Ezzat. 2005. Transferable videorealistic speech animation. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA'05). Association for Computing Machinery, 143–151.
[12]
Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV’18). 520–535.
[13]
Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7832–7841.
[14]
Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? In British Machine Vision Conference.
[15]
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. (TOG) 35, 4 (2016), 1–11.
[16]
Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable videorealistic speech animation. ACM Trans. Graph. 21, 3 (July 2002), 388–398.
[17]
Ohad Fried and Maneesh Agrawala. 2019. Puppet dubbing. In Eurographics Symposium on Rendering - DL-only and Industry Track, Tamy Boubekeur and Pradeep Sen (Eds.). The Eurographics Association.
[18]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video. ACM Trans. Graph. 38, 4, (July 2019), Article 68, 14 pages.
[19]
John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, and David S. Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report n 93 (1993).
[20]
P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormaehlen, P. Perez, and C. Theobalt. 2014. Automatic face reenactment. In IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA. 4217--4224.
[21]
Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2015. VDub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. Comput. Graph. Forum 34, 2 (May 2015), 193–204.
[22]
Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of personalized 3D face rigs from monocular video. ACM Trans. Graph. 35, 3, (May 2016), Article 28, 15 pages.
[23]
Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. 2018. Warp-guided GANs for single-photo facial animation. In SIGGRAPH Asia 2018 Technical Papers (SIGGRAPH Asia’18). ACM, New York, NY, Article 231, 231:1–231:12 pages. http://doi.acm.org/10.1145/3272127.3275043
[24]
Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems. 4480–4490.
[25]
Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010. Being John Malkovich. In Proceedings of the European Conference on Computer Vision (ECCV'10). 341–353.
[26]
Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preserving visual dubbing. ACM Trans. Graph. 38, 6, (Nov. 2019), Article 178, 13 pages.
[27]
H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. 2018. Deep video portraits. ACM Trans. Graph. 37, 4, Article 163 (August 2018), 14 pages.
[28]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR’15), Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
[29]
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. arxiv:eess.AS/1910.06711
[30]
Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, and Yoshua Bengio. 2017. ObamaNet: Photo-realistic lip-sync from text. arxiv:cs.CV/1801.01442
[31]
V.I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, Vol. 10. 707.
[32]
K. Liu and J. Ostermann. 2011. Realistic facial expression synthesis for an image-based talking head. In IEEE International Conference on Multimedia and Expo, Barcelona, Spain. 1--6.
[33]
Wesley Mattheyses, Lukas Latacz, and Werner Verhelst. 2010. Optimized photorealistic audiovisual speech synthesis using active appearance modeling. In Auditory-Visual Speech Processing. 8–1.
[34]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning. 1310–1318.
[35]
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. ACM, 181–190.
[36]
A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. 2019. GANimation: One-shot anatomically consistent facial animation. Int J. Comput. Vis. 128 (2020), 698--713. https://doi.org/10.1007/s11263-019-01210-3
[37]
Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology. ACM, 113–122.
[38]
Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. 2019. Talking face generation by conditional recurrent adversarial network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 919–925.
[39]
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Trans. Graph. 36, 4, (July 2017), Article 95, 13 pages.
[40]
Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, Rohit Pandey, Sean Fanello, Gordon Wetzstein, Jun-Yan Zhu, Christian Theobalt, Maneesh Agrawala, Eli Shechtman, Dan B. Goldman, and Michael Zollhöfer. 2020. State of the art on neural rendering. Comput. Graph. Forum 39, 2 (2020), 701--727.
[41]
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision. Springer, 716–731.
[42]
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2Face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2387–2395.
[43]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. In Arxiv. https://arxiv.org/abs/1609.03499
[44]
Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popović. 2005. Face transfer with multilinear models. ACM Trans. Graph. 24, 3 (July 2005), 426–433.
[45]
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-End Speech-Driven Facial Animation with Temporal GANs. arxiv:eess.AS/1805.09313
[46]
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 128 (2019), 1398--1413.
[47]
Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. JACM 21, 1 (1974), 168–173.
[48]
Lijuan Wang, Wei Han, Frank Soong, and Qiang Huo. 2011. Text-driven 3D photo-realistic talking head. In INTERSPEECH 2011 (interspeech 2011 ed.). International Speech Communication Association. https://www.microsoft.com/en-us/research/publication/text-driven-3d-photo-realistic-talking-head/
[49]
O. Wiles, A.S. Koepke, and A. Zisserman. 2018. X2Face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision.
[50]
Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. In Proceedings of Acoustics 2008. Citeseer.
[51]
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. arxiv:cs.CV/1905.08233
[52]
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In AAAI Conference on Artificial Intelligence (AAAI’19).
[53]
Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600–612.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics
ACM Transactions on Graphics  Volume 40, Issue 3
June 2021
264 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3463476
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2021
Accepted: 01 February 2021
Revised: 01 January 2021
Received: 01 June 2020
Published in TOG Volume 40, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Text-based video editing
  2. talking-heads
  3. phonemes
  4. retargeting

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)4
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media