research-article

Dense Contrastive Visual-Linguistic Pretraining

Authors:

Gerard de Melo,

Sen SuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5203 - 5212

https://doi.org/10.1145/3474085.3475637

Published: 17 October 2021 Publication History

Abstract

Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[2]

Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. arXiv e-prints (2020), arXiv--2008.

[3]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).

[4]

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297 (2020).

[5]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019).

[6]

Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[7]

Karan Desai and Justin Johnson. 2020. VirTex: Learning Visual Representations from Textual Annotations. arXiv preprint arXiv:2006.06666 (2020).

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. arXiv preprint arXiv:2006.06195 (2020).

[10]

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6639--6648.

[11]

Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, and Anoop Cherian. 2021. Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In AAAI Conference on Artificial Intelligence.

[12]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).

[13]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).

[14]

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020).

[15]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.

Digital Library

[16]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2019. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019).

[17]

Qianjiang Hu, XiaoWang,Wei Hu, and Guo-Jun Qi. 2020. AdCo: Adversarial Contrast for Efficient Learning of Unsupervised Representations from Self-Trained Negative Adversaries. arXiv preprint arXiv:2011.08435 (2020).

[18]

Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real- World Visual Reasoning and Compositional Question Answering. Conference on Computer Vision and Pattern Recognition (CVPR) (2019).

[19]

Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. 2020. Robust pre-training by adversarial contrastive learning. arXiv preprint arXiv:2010.13337 (2020).

[20]

Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting selfsupervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1920--1929.

[21]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.

Digital Library

[22]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

Digital Library

[23]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.

Digital Library

[24]

Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 10313--10322.

[25]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).

[26]

Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. arXiv preprint arXiv:2004.06165 (2020).

[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[28]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[29]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23.

Digital Library

[30]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations. https: //openreview.net/forum?id=rJzIBfZAb

[31]

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1765--1773.

[32]

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision. Springer, 792--807.

[33]

Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision. Springer, 69--84.

[34]

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24 (2011), 1143--1151.

Digital Library

[35]

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2536--2544.

[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).

[37]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).

[38]

Guanghui Ren, Lejian Ren, Yue Liao, Si Liu, Bo Li, Jizhong Han, and Shuicheng Yan. 2020. Scene graph generation with hierarchical context. IEEE Transactions on Neural Networks and Learning Systems 32, 2 (2020), 909--915.

[39]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

Digital Library

[40]

Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. 2019. Adversarial training for free! arXiv preprint arXiv:1904.12843 (2019).

Digital Library

[41]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556--2565.

[42]

Lei Shi, Shijie Geng, Kai Shuang, Chiori Hori, Songxiang Liu, Peng Gao, and Sen Su. 2020. Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4412--4416.

[43]

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 217--223.

[44]

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491 (2018).

[45]

Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).

[46]

Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).

[47]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019).

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

Digital Library

[49]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[50]

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733--3742.

[51]

Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 (2019).

[52]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281--6290.

[53]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720--6731.

[54]

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. 2020. FreeLB: Enhanced Adversarial Training for Natural Language Understanding. In ICLR. https://openreview.net/forum?id=BygzbyHFvB

Cited By

Xu WChen KGao ZWei ZChen JJiang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681538(748-757)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681538
Lin JZheng YChen XRen YPu XHe JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Cross-view Contrastive Unification Guides Generative Pretraining for Molecular Property PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681193(2108-2116)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681193
Zheng WYan LChen LLi QWang F(2024)Knowledge-Embedded Mutual Guidance for Visual ReasoningIEEE Transactions on Cybernetics10.1109/TCYB.2023.331089254:4(2579-2591)Online publication date: Apr-2024
https://doi.org/10.1109/TCYB.2023.3310892
Show More Cited By

Index Terms

Dense Contrastive Visual-Linguistic Pretraining
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned, and Perspectives
Modern natural language processing (NLP) methods employ self-supervised pretraining objectives such as masked language modeling to boost the performance of various downstream tasks. These pretraining methods are frequently extended with recurrence, ...
Contrastive Pretraining for Echocardiography Segmentation with Limited Data
Medical Image Understanding and Analysis
Abstract
Contrastive learning has proven useful in many applications where access to labelled data is limited. The lack of annotated data is particularly problematic in medical image segmentation as it is difficult to have clinical experts manually ...
Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conversational search is a crucial and promising branch in information retrieval. In this paper, we reveal that not all historical conversational turns are necessary for understanding the intent of the current query. The redundant noisy turns in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Project supported by the Foundation for Innovative Research Groups of the National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
193
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu WChen KGao ZWei ZChen JJiang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681538(748-757)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681538
Lin JZheng YChen XRen YPu XHe JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Cross-view Contrastive Unification Guides Generative Pretraining for Molecular Property PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681193(2108-2116)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681193
Zheng WYan LChen LLi QWang F(2024)Knowledge-Embedded Mutual Guidance for Visual ReasoningIEEE Transactions on Cybernetics10.1109/TCYB.2023.331089254:4(2579-2591)Online publication date: Apr-2024
https://doi.org/10.1109/TCYB.2023.3310892
Gao SJia XRen XTsang IGuo Q(2024)Boosting Transferability in Vision-Language Attacks via Diversification Along the Intersection Region of Adversarial TrajectoryComputer Vision – ECCV 202410.1007/978-3-031-72998-0_25(442-460)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-3-031-72998-0_25
Rethmeier NAugenstein I(2023)A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned, and PerspectivesACM Computing Surveys10.1145/356197055:10(1-17)Online publication date: 2-Feb-2023
https://dl.acm.org/doi/10.1145/3561970
Lu DWang ZWang TGuan WGao HZheng F(2023)Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00016(102-111)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00016
Jiang JLiu ZZheng N(2023)Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question AnsweringInternational Journal of Computer Vision10.1007/s11263-023-01858-y132:1(185-207)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/s11263-023-01858-y
Zhang JYi QSang JMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Towards Adversarial Attack on Vision-Language Pre-training ModelsProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547801(5005-5013)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547801
Fu TWang XWang W(2022)Language-Driven Artistic Style TransferComputer Vision – ECCV 202210.1007/978-3-031-20059-5_41(717-734)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20059-5_41

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten