skip to main content
10.1145/3474085.3475415acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training

Published: 17 October 2021 Publication History

Abstract

The goal of this work is to learn discriminative visual representations for lip reading without access to manual text annotation. Recent advances in cross-modal self-supervised learning have shown that the corresponding audio can serve as a supervisory signal to learn effective visual representations for lip reading. However, existing methods only exploit the natural synchronization of the video and the corresponding audio. We find that both video and audio are actually composed of speech-related information, identity-related information, and modal information. To make the visual representations (i) more discriminative for lip reading and (ii) indiscriminate with respect to the identities and modals, we propose a novel self-supervised learning framework called Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous methods by explicitly forcing the visual representations disentangled from speech-unrelated information. Experimental results clearly show that the proposed method outperforms state-of-the-art cross-modal self-supervised baselines by a large margin. Besides, ADC-SSL can outperform its supervised counterpart without any finetune.

References

[1]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).
[2]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2020. ASR is all you need: Cross-modal distillation for lip reading. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2143--2147.
[3]
Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV). 435--451.
[4]
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).
[5]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[6]
Pascal Belin, Shirley Fecteau, and Catherine Bedard. 2004. Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences, Vol. 8, 3 (2004), 129--135.
[7]
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 1021--1030.
[8]
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).
[9]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).
[10]
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020 b. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020).
[11]
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3444--3453.
[12]
Joon Son Chung and Andrew Zisserman. 2016a. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103.
[13]
Joon Son Chung and Andrew Zisserman. 2016b. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251--263.
[14]
J. S. Chung and A. Zisserman. 2017. Lip Reading in Profile. In British Machine Vision Conference.
[15]
Soo-Whan Chung, Joon Son Chung, and Hong Goo Kang. 2020 a. Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval. IEEE Journal of Selected Topics in Signal Processing (2020).
[16]
Soo-Whan Chung, Hong Goo Kang, and Joon Son Chung. 2020 b. Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326 (2020).
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[19]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francc ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, Vol. 17, 1 (2016), 2096--2030.
[20]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.
[21]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.
[22]
Michael Gutmann and Aapo Hyv"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304.
[23]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.
[24]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729--9738.
[25]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[26]
Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. 2019. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019).
[27]
Minseon Kim, Jihoon Tack, and Sung Ju Hwang. 2020. Adversarial Self-Supervised Contrastive Learning. In Thirty-fourth Conference on Neural Information Processing Systems, NeurIPS 2020. NeurIPS.
[28]
Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems. 7763--7774.
[29]
Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietik"ainen. 2020. Deep learning for generic object detection: A survey. International journal of computer vision, Vol. 128, 2 (2020), 261--318.
[30]
Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, Vol. 264, 5588 (1976), 746--748.
[31]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[32]
Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV). 631--648.
[33]
Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11205--11214.
[34]
Stavros Petridis, Themos Stafylakis, Pingehuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-end audiovisual speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6548--6552.
[35]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
[36]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[37]
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.
[38]
Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).
[39]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[41]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733--3742.
[42]
Xingxuan Zhang, Feng Cheng, and Shilin Wang. 2019. Spatio-temporal fusion based convolutional sequence learning for lip reading. In Proceedings of the IEEE International Conference on Computer Vision. 713--722.
[43]
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9299--9306.

Cited By

View all

Index Terms

  1. Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 17 October 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. adversarial training
        2. cross-modal
        3. lip reading
        4. self-supervised learning

        Qualifiers

        • Research-article

        Funding Sources

        • Natural Science Foundation of China
        • Academy of Finland
        • Outstanding Talents of ?Ten Thousand Talents Plan?

        Conference

        MM '21
        Sponsor:
        MM '21: ACM Multimedia Conference
        October 20 - 24, 2021
        Virtual Event, China

        Acceptance Rates

        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)69
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 25 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media