skip to main content
10.1145/3583780.3614967acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition

Published: 21 October 2023 Publication History

Abstract

Multimodal Named Entity Recognition (MNER) aims to combine data from different modalities (e.g. text, images, videos, etc.) for recognition and classification of named entities, which is crucial for constructing Multimodal Knowledge Graphs (MMKGs). However, existing researches suffer from two prominant issues: over-reliance on textual features while neglecting visual features, and the lack of effective reduction of the feature space discrepancy of multimodal data. To overcome these challenges, this paper proposes a Multi-Grained Interaction Contrastive Learning framework for MNER task, namely MGICL. MGICL slices data into different granularities, i.e., sentence level/word token level for text, and image level/object level for image. By utilizing multimodal features with different granularities, the framework enables cross-contrast and narrows down the feature space discrepancy between modalities. Moreover, it facilitates the acquisition of valuable visual features by the text. Additionally, a visual gate control mechanism is introduced to dynamically select relevant visual information, thereby reducing the impact of visual noise. Experimental results demonstrate that the proposed MGICL framework satisfactorily tackles the challenges of MNER through enhancing information interaction of multimodal data and reducing the effect of noise, and hence, effectively improves the performance of MNER.

Supplementary Material

MP4 File (full2187-video.mp4)
Presentation video for CIKM2023-full2187paper

References

[1]
Meysam Asgari-Chenaghlu, Mohammad-Reza Feizi-Derakhshi, Leili Farzinvash, and Cina Motamed. 2020. A multimodal deep learning approach for named entity recognition from social media. CoRR, Vol. abs/2001.06888 (2020). showeprint[arXiv]2001.06888 https://arxiv.org/abs/2001.06888
[2]
Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Database Systems for Advanced Applications - 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11--14, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12682), Christian S. Jensen, Ee-Peng Lim, De-Nian Yang, Wang-Chien Lee, Vincent S. Tseng, Vana Kalogeraki, Jen-Wei Huang, and Chih-Ya Shen (Eds.). Springer, 186--201. https://doi.org/10.1007/978--3-030--73197--7_12
[3]
Xiang Chen, Ningyu Zhang, Lei Li, Shumin Deng, Chuanqi Tan, Changliang Xu, Fei Huang, Luo Si, and Huajun Chen. 2022a. Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 904--915. https://doi.org/10.1145/3477495.3531992
[4]
Xiang Chen, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022b. Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. CoRR, Vol. abs/2205.03521 (2022). https://doi.org/10.48550/arXiv.2205.03521 showeprint[arXiv]2205.03521
[5]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmá n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 8440--8451. https://doi.org/10.18653/v1/2020.acl-main.747
[6]
Meihuizi Jia, Lei Shen, Xin Shen, Lejian Liao, Meng Chen, Xiaodong He, Zhendong Chen, and Jiaqi Li. 2022a. MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding. CoRR, Vol. abs/2211.14739 (2022). https://doi.org/10.48550/arXiv.2211.14739 showeprint[arXiv]2211.14739
[7]
Meihuizi Jia, Xin Shen, Lei Shen, Jinhui Pang, Lejian Liao, Yang Song, Meng Chen, and Xiaodong He. 2022b. Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, Jo a o Magalh a es, Alberto Del Bimbo, Shin'ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 3549--3558. https://doi.org/10.1145/3503161.3548427
[8]
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu-Hong Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6--14, 2021, virtual, Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 9694--9705. https://proceedings.neurips.cc/paper/2021/hash/505259756244493872b7709a8a01b536-Abstract.html
[9]
Luping Liu, Meiling Wang, Mozhi Zhang, Linbo Qing, and Xiaohai He. 2022. UAMNer: uncertainty-aware multimodal named entity recognition in social media posts. Appl. Intell., Vol. 52, 4 (2022), 4109--4125. https://doi.org/10.1007/s10489-021-02546--5
[10]
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, 1990--1999. https://doi.org/10.18653/v1/P18--1185
[11]
Junyu Lu, Dixiang Zhang, Jiaxing Zhang, and Pingjian Zhang. 2022. Flat Multi-modal Interaction Transformer for Named Entity Recognition. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12--17, 2022, Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (Eds.). International Committee on Computational Linguistics, 2055--2064. https://aclanthology.org/2022.coling-1.179
[12]
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, Jo a o Magalh a es, Alberto Del Bimbo, Shin'ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 638--647. https://doi.org/10.1145/3503161.3547910
[13]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 852--860. https://doi.org/10.18653/v1/n18--1078
[14]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. http://proceedings.mlr.press/v139/radford21a.html
[15]
R. Smith. 2007. An Overview of the Tesseract OCR Engine. In 9th International Conference on Document Analysis and Recognition (ICDAR 2007), 23--26 September, Curitiba, Paraná, Brazil. IEEE Computer Society, 629--633. https://doi.org/10.1109/ICDAR.2007.4376991
[16]
Lin Sun, Jiquan Wang, Yindu Su, Fangsheng Weng, Yuxuan Sun, Zengwei Zheng, and Yuanyi Chen. 2020. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8--13, 2020, Donia Scott, Nú ria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, 1852--1862. https://doi.org/10.18653/v1/2020.coling-main.168
[17]
Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng. 2021. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 13860--13868. https://ojs.aaai.org/index.php/AAAI/article/view/17633
[18]
Xinyu Wang, Jiong Cai, Yong Jiang, Pengjun Xie, Kewei Tu, and Wei Lu. 2022a. Named Entity and Relation Extraction with Multi-Modal Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7--11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 5925--5936. https://aclanthology.org/2022.findings-emnlp.437
[19]
Xinyu Wang, Min Gui, Yong Jiang, Zixia Jia, Nguyen Bach, Tao Wang, Zhongqiang Huang, and Kewei Tu. 2022b. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10--15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivá n Vladimir Meza Ru'i z (Eds.). Association for Computational Linguistics, 3176--3189. https://doi.org/10.18653/v1/2022.naacl-main.232
[20]
Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Jiabo Ye, Ming Yan, and Yanghua Xiao. 2022c. PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition. In Database Systems for Advanced Applications - 27th International Conference, DASFAA 2022, Virtual Event, April 11--14, 2022, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 13247), Arnab Bhattacharya, Janice Lee, Mong Li, Divyakant Agrawal, P. Krishna Reddy, Mukesh K. Mohania, Anirban Mondal, Vikram Goyal, and Rage Uday Kiran (Eds.). Springer, 297--305. https://doi.org/10.1007/978--3-031-00129--1_24
[21]
Xuwu Wang, Jiabo Ye, Zhixu Li, Junfeng Tian, Yong Jiang, Ming Yan, Ji Zhang, and Yanghua Xiao. 2022d. CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention. In IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan, July 18--22, 2022. IEEE, 1--6. https://doi.org/10.1109/ICME52920.2022.9859972
[22]
Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12--16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 1038--1046. https://doi.org/10.1145/3394171.3413650
[23]
Bo Xu, Shizhou Huang, Chaofeng Sha, and Hongya Wang. 2022. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. In WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, K. Selcuk Candan, Huan Liu, Leman Akoglu, Xin Luna Dong, and Jiliang Tang (Eds.). ACM, 1215--1223. https://doi.org/10.1145/3488560.3498475
[24]
Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 11542--11552. https://doi.org/10.1109/ICCV48922.2021.01136
[25]
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained Interactive Language-Image Pre-Training. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=cpDhcsEDC2
[26]
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 3342--3352. https://doi.org/10.18653/v1/2020.acl-main.306
[27]
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021b. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 14347--14355. https://ojs.aaai.org/index.php/AAAI/article/view/17687
[28]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021a. VinVL: Revisiting Visual Representations in Vision-Language Models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 5579--5588. https://doi.org/10.1109/CVPR46437.2021.00553
[29]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2--7, 2019, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 5674--5681. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16432
[30]
Fei Zhao, Chunhui Li, Zhen Wu, Shangyu Xing, and Xinyu Dai. 2022. Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. In MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, Jo a o Magalh a es, Alberto Del Bimbo, Shin'ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 3983--3992. https://doi.org/10.1145/3503161.3548228
[31]
Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li. 2020. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia, Vol. 23 (2020), 2520--2532.
[32]
Xiangru Zhu, Zhixu Li, Xiaodan Wang, Xueyao Jiang, Penglei Sun, Xuwu Wang, Yanghua Xiao, and Nicholas Jing Yuan. 2022. Multi-Modal Knowledge Graph Construction and Application: A Survey. CoRR, Vol. abs/2202.05786 (2022). showeprint[arXiv]2202.05786 https://arxiv.org/abs/2202.05786

Cited By

View all

Index Terms

  1. MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
      October 2023
      5508 pages
      ISBN:9798400701245
      DOI:10.1145/3583780
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. contrastive learning
      2. multi-grained interaction contrastive learning
      3. multimodal named entity recognition
      4. multimodal representation
      5. visual gate

      Qualifiers

      • Research-article

      Funding Sources

      • National Key R&D Program of China
      • NSFC

      Conference

      CIKM '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)240
      • Downloads (Last 6 weeks)27
      Reflects downloads up to 21 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media