research-article

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training

Authors:

Changchong Sheng,

Matti Pietikäinen,

Li LiuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2456 - 2464

https://doi.org/10.1145/3474085.3475415

Published: 17 October 2021 Publication History

Abstract

The goal of this work is to learn discriminative visual representations for lip reading without access to manual text annotation. Recent advances in cross-modal self-supervised learning have shown that the corresponding audio can serve as a supervisory signal to learn effective visual representations for lip reading. However, existing methods only exploit the natural synchronization of the video and the corresponding audio. We find that both video and audio are actually composed of speech-related information, identity-related information, and modal information. To make the visual representations (i) more discriminative for lip reading and (ii) indiscriminate with respect to the identities and modals, we propose a novel self-supervised learning framework called Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous methods by explicitly forcing the visual representations disentangled from speech-unrelated information. Experimental results clearly show that the proposed method outperforms state-of-the-art cross-modal self-supervised baselines by a large margin. Besides, ADC-SSL can outperform its supervised counterpart without any finetune.

References

[1]

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018).

[2]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2020. ASR is all you need: Cross-modal distillation for lip reading. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2143--2147.

[3]

Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV). 435--451.

Digital Library

[4]

Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).

[5]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[6]

Pascal Belin, Shirley Fecteau, and Catherine Bedard. 2004. Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences, Vol. 8, 3 (2004), 129--135.

[7]

Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 1021--1030.

[8]

Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).

[9]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).

[10]

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020 b. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020).

[11]

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3444--3453.

[12]

Joon Son Chung and Andrew Zisserman. 2016a. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103.

[13]

Joon Son Chung and Andrew Zisserman. 2016b. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251--263.

[14]

J. S. Chung and A. Zisserman. 2017. Lip Reading in Profile. In British Machine Vision Conference.

[15]

Soo-Whan Chung, Joon Son Chung, and Hong Goo Kang. 2020 a. Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval. IEEE Journal of Selected Topics in Signal Processing (2020).

[16]

Soo-Whan Chung, Hong Goo Kang, and Joon Son Chung. 2020 b. Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326 (2020).

[17]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[19]

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francc ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, Vol. 17, 1 (2016), 2096--2030.

Digital Library

[20]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.

Digital Library

[21]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.

[22]

Michael Gutmann and Aapo Hyv"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304.

[23]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.

Digital Library

[24]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729--9738.

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[26]

Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. 2019. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019).

[27]

Minseon Kim, Jihoon Tack, and Sung Ju Hwang. 2020. Adversarial Self-Supervised Contrastive Learning. In Thirty-fourth Conference on Neural Information Processing Systems, NeurIPS 2020. NeurIPS.

[28]

Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems. 7763--7774.

[29]

Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietik"ainen. 2020. Deep learning for generic object detection: A survey. International journal of computer vision, Vol. 128, 2 (2020), 261--318.

Digital Library

[30]

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, Vol. 264, 5588 (1976), 746--748.

[31]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

[32]

Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV). 631--648.

Digital Library

[33]

Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11205--11214.

[34]

Stavros Petridis, Themos Stafylakis, Pingehuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-end audiovisual speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6548--6552.

Digital Library

[35]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.

[36]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[37]

Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.

[38]

Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).

[39]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.

Digital Library

[40]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

Digital Library

[41]

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733--3742.

[42]

Xingxuan Zhang, Feng Cheng, and Shilin Wang. 2019. Spatio-temporal fusion based convolutional sequence learning for lip reading. In Proceedings of the IEEE International Conference on Computer Vision. 713--722.

[43]

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9299--9306.

Digital Library

Cited By

Dahou Djilali YNarayan SLe Bihan EBoussaid HAlmazrouei EDebbah M(2024)Do VSR Models Generalize Beyond LRS3?2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00650(6621-6630)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00650
Sheng CKuang GBai LHou CGuo YXu XPietikäinen MLiu L(2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
https://doi.org/10.1109/TPAMI.2024.3376710
He YYang LWang SWee-Chung Liew A(2024)Lip Feature Disentanglement for Visual Speaker Authentication in Natural ScenesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340564034:10(9898-9909)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3405640
Show More Cited By

Index Terms

Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Domain Adaptation for Speaker Verification Based on Self-supervised Learning with Adversarial Training
MultiMedia Modeling
Abstract
Speaker verification models trained on a single domain have difficulty keeping performance on new domain data. Adversarial training maps different domain data to the same subspace to handle this problem. However, adversarial training only uses ...
Adversarial Self-supervised Learning for Semi-supervised 3D Action Recognition
Computer Vision – ECCV 2020
Abstract
We consider the problem of semi-supervised 3D action recognition which has been rarely explored before. Its major challenge lies in how to effectively learn motion representations from unlabeled data. Self-supervised learning (SSL) has been proved ...
Decoupled Adversarial Contrastive Learning for Self-supervised Adversarial Robustness
Computer Vision – ECCV 2022
Abstract
Adversarial training (AT) for robust representation learning and self-supervised learning (SSL) for unsupervised representation learning are two active research fields. Integrating AT into SSL, multiple prior works have accomplished a highly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China
Academy of Finland
Outstanding Talents of ?Ten Thousand Talents Plan?

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
454
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)4

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dahou Djilali YNarayan SLe Bihan EBoussaid HAlmazrouei EDebbah M(2024)Do VSR Models Generalize Beyond LRS3?2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00650(6621-6630)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00650
Sheng CKuang GBai LHou CGuo YXu XPietikäinen MLiu L(2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
https://doi.org/10.1109/TPAMI.2024.3376710
He YYang LWang SWee-Chung Liew A(2024)Lip Feature Disentanglement for Visual Speaker Authentication in Natural ScenesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340564034:10(9898-9909)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3405640
Zhang XZhang CWang TTang JLao SLi HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Slow-Fast Time Parameter Aggregation Network for Class-Incremental Lip ReadingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612460(747-756)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612460
Su ZFang SRekimoto J(2023)LipLearner: Customizable Silent Speech Interactions on Mobile DevicesProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581465(1-21)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3581465
Dahou Djilali YNarayan SBoussaid HAlmazrouei EDebbah M(2023)Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01268(13744-13755)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01268
Wang XZhang SQing ZLv YGao CSang N(2023)Cross-domain few-shot action recognition with unlabeled videosComputer Vision and Image Understanding10.1016/j.cviu.2023.103737233:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.cviu.2023.103737
Zhang XZhang CSui JSheng CDeng WLiu L(2022)Boosting Lip Reading with a Multi-View Fusion Network2022 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME52920.2022.9859810(1-6)Online publication date: 18-Jul-2022
https://doi.org/10.1109/ICME52920.2022.9859810

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents