research-article

Open access

Face0: Instantaneously Conditioning a Text-to-Image Model on a Face

Authors:

Yaniv LeviathanAuthors Info & Claims

SA '23: SIGGRAPH Asia 2023 Conference Papers

Article No.: 94, Pages 1 - 10

https://doi.org/10.1145/3610548.3618249

Published: 11 December 2023 Publication History

All formats PDF

Abstract

We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate face-conditioned images in just a couple of seconds. Our method achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using a fixed random vector instead of a face embedding from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, our method decouples the model’s textual biases from its biases on faces. While requiring further research, we hope that this may help reduce biases in future text-to-image models.

References

[1]

Rameen Abdal, Peihao Zhu, John Femiani, Niloy J. Mitra, and Peter Wonka. 2021. CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions. https://doi.org/10.48550/ARXIV.2112.05219

[2]

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arxiv:2110.01963 [cs.CY]

[3]

Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. arxiv:1710.08092 [cs.CV]

[4]

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. 2023. Muse: Text-To-Image Generation via Masked Generative Transformers. arxiv:2301.00704 [cs.CV]

[5]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv:2010.11929 [cs.CV]

[6]

Patrick Esser, Robin Rombach, and Björn Ommer. 2020. Taming Transformers for High-Resolution Image Synthesis. https://doi.org/10.48550/ARXIV.2012.09841

[7]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. https://doi.org/10.48550/ARXIV.2208.01618

[8]

Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2023. Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. arxiv:2302.12228 [cs.CV]

[9]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1406.2661

[10]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. https://doi.org/10.48550/ARXIV.2006.11239

[11]

Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. https://doi.org/10.48550/ARXIV.2207.12598

[12]

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arxiv:2106.09685 [cs.CL]

[13]

Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. 2008. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition.

[14]

Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1812.04948

[15]

Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. 2023. Dreamix: Video Diffusion Models are General Video Editors. arxiv:2302.01329 [cs.CV]

[16]

Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Daniel Cohen-or. 2022. MyStyle: A Personalized Generative Prior. arxiv:2203.17272 [cs.CV]

[17]

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. https://doi.org/10.48550/ARXIV.1711.00937

[18]

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. https://doi.org/10.48550/ARXIV.2103.17249

[19]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/ARXIV.2103.00020

[20]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. https://doi.org/10.48550/ARXIV.2204.06125

[21]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. https://doi.org/10.48550/ARXIV.2102.12092

[22]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021a. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752 [cs.CV]

[23]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021b. High-Resolution Image Synthesis with Latent Diffusion Models. https://doi.org/10.48550/ARXIV.2112.10752

[24]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2022. DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation.

[25]

Simo Ryu. 2023. Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning. https://github.com/cloneofsimo/lora.

[26]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. https://doi.org/10.48550/ARXIV.2205.11487

[27]

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. arxiv:2210.08402 [cs.CV]

[28]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arxiv:1602.07261 [cs.CV]

[29]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. arxiv:1409.4842 [cs.CV]

[30]

Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. 2022. UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image. arxiv:2210.09477 [cs.CV]

[31]

Yangyang Xu, Bailin Deng, Junle Wang, Yanqing Jing, Jia Pan, and Shengfeng He. 2022. High-resolution face swapping via latent semantics disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7642–7651.

[32]

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. 2022. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. https://doi.org/10.48550/ARXIV.2206.10789

[33]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (oct 2016), 1499–1503. https://doi.org/10.1109/lsp.2016.2603342

[34]

Yuhao Zhu, Qi Li, Jian Wang, Cheng-Zhong Xu, and Zhenan Sun. 2021. One shot face swapping on megapixels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4834–4844.

Cited By

Jones MWang SKumari NBau DZhu J(2024)Customizing Text-to-Image Models with a Single Image PairSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687642(1-13)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687642
Kumari NSu GZhang RPark TShechtman EZhu J(2024)Customizing Text-to-Image Diffusion with Object Viewpoint ControlSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687564(1-13)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687564
Han XZhao YYou MCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681678(7862-7870)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681678
Show More Cited By

Index Terms

Face0: Instantaneously Conditioning a Text-to-Image Model on a Face
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Rendering

Recommendations

Text-Guided Synthesis of Masked Face Images
The COVID-19 pandemic has made us all understand that wearing a face mask protects us from the spread of respiratory viruses. Face authentication systems, which are trained on the basis of facial key points such as the eyes, nose, and mouth, found it ...
UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image

Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-...
OrienText: Surface Oriented Textual Image Generation
SA '24: SIGGRAPH Asia 2024 Technical Communications
Textual content in images is crucial in e-commerce sectors, particularly in marketing campaigns, product imaging, advertising, and the entertainment industry. Current text-to-image (T2I) generation diffusion models, though proficient at producing high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '23: SIGGRAPH Asia 2023 Conference Papers

December 2023

1113 pages

ISBN:9798400703157

DOI:10.1145/3610548

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SA '23

Sponsor:

SIGGRAPH

SA '23: SIGGRAPH Asia 2023

December 12 - 15, 2023

NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
716
Total Downloads

Downloads (Last 12 months)716
Downloads (Last 6 weeks)72

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jones MWang SKumari NBau DZhu J(2024)Customizing Text-to-Image Models with a Single Image PairSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687642(1-13)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687642
Kumari NSu GZhang RPark TShechtman EZhu J(2024)Customizing Text-to-Image Diffusion with Object Viewpoint ControlSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687564(1-13)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687564
Han XZhao YYou MCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681678(7862-7870)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681678
Lin JZhao GXu JWang GWang ZDantcheva ADu LChen CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DiffTV: Identity-Preserved Thermal-to-Visible Face Translation via Feature Alignment and Dual-Stage ConditionsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680635(10930-10938)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680635
Avrahami OHertz AVinker YArar MFruchter SFried OCohen-Or DLischinski D(2024)The Chosen One: Consistent Characters in Text-to-Image Diffusion ModelsACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657430(1-12)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657430
Lin JWu YWang ZLiu XGuo Y(2024)Pair-ID: A Dual Modal Framework for Identity Preserving Image GenerationIEEE Signal Processing Letters10.1109/LSP.2024.346164831(2715-2719)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3461648
Guo LLiu MFu KZhou J(2024)SSIE-Diffusion: Personalized Generative Model for Subject-Specific Image Editing2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10649946(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10649946
Zhang YSong YYu JPan HJing Z(2024)Fast Personalized Text to Image Synthesis with Attention InjectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447042(6195-6199)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447042
Baltsou GSarridis IKoutlis CPapadopoulos S(2024)SDFD: Building a Versatile Synthetic Face Image Dataset with Diverse Attributes2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10581864(1-10)Online publication date: 27-May-2024
https://doi.org/10.1109/FG59268.2024.10581864
Cui SGuo JAn XDeng JZhao YWei XFeng Z(2024)IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00100(950-959)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00100
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents