skip to main content
10.1145/3503161.3547943acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

Published: 10 October 2022 Publication History

Abstract

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like'' event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.

Supplementary Material

MP4 File (MM22-fp0801.mp4)
Video Presentation of GSRFormer: Grounded situation recognition transformer with alternate semantic attention refinement

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[3]
Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics.
[4]
Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1989--1998.
[5]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European conference on computer vision. Springer, 213--229.
[6]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[7]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10800--10809.
[8]
Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual Critic Multi-Agent Training for Scene Graph Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[9]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5659--5667.
[10]
Tao Chen, Xiangnan He, and Min-Yen Kan. 2016. Context-aware image tweet modelling and recommendation. In Proceedings of the ACM international conference on Multimedia. 1018--1027.
[11]
Xiangning Chen, Cihang Xie, Mingxing Tan, Li Zhang, Cho-Jui Hsieh, and Boqing Gong. 2021. Robust and accurate object detection via adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16622--16631.
[12]
Zhi-Qi Cheng, Yang Liu, Xiao Wu, and Xian-Sheng Hua. 2016. Video ecommerce: Towards online video advertising. In Proceedings of the 24th ACM international conference on Multimedia. 1365--1374.
[13]
Zhi-Qi Cheng, Xiao Wu, Siyu Huang, Jun-Xiu Li, Alexander G Hauptmann, and Qiang Peng. 2018. Learning to transfer: Generalizable attribute learning with multitask neural model search. In Proceedings of the ACM international conference on Multimedia. 90--98.
[14]
Zhi-Qi Cheng, Xiao Wu, Yang Liu, and Xian-Sheng Hua. 2017a. Video eCommerce: Toward large scale online video advertising. IEEE transactions on multimedia, Vol. 19, 6 (2017), 1170--1183.
[15]
Zhi-Qi Cheng, Xiao Wu, Yang Liu, and Xian-Sheng Hua. 2017b. Video2shop: Exact matching clothes in videos to online shopping images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4048--4056.
[16]
Zhi-Qi Cheng, Hao Zhang, Xiao Wu, and Chong-Wah Ngo. 2017c. On the selection of anchors and targets for video hyperlinking. In Proceedings of the ACM on International Conference on Multimedia Retrieval. 287--293.
[17]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
[18]
Junhyeong Cho, Youngseok Yoon, and Suha Kwak. 2022. Collaborative Transformers for Grounded Situation Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[19]
Junhyeong Cho, Youngseok Yoon, Hyeonjun Lee, and Suha Kwak. 2021. Grounded Situation Recognition with Transformers. Proceedings of the British Machine Vision Conference (2021).
[20]
Yuren Cong, Wentong Liao, Hanno Ackermann, Michael Ying Yang, and Bodo Rosenhahn. 2021. Spatial-Temporal Transformer for Dynamic Scene Graph Generation. arXiv preprint arXiv:2107.12309 (2021).
[21]
Simone Conia, Andrea Bacciu, and Roberto Navigli. 2021. Unifying cross-lingual Semantic Role Labeling with heterogeneous linguistic resources. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 338--351.
[22]
Thilini Cooray, Ngai-Man Cheung, and Wei Lu. 2020. Attention-based context aware reasoning for situation recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4736--4745.
[23]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578--10587.
[24]
Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. 2021. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7373--7382.
[25]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[26]
Charles J Fillmore and Collin F Baker. 2001. Frame semantics for text understanding. In Proceedings of WordNet and Other Lexical Resources Workshop, NAACL, Vol. 6.
[27]
Charles J Fillmore, Christopher R Johnson, and Miriam RL Petruck. 2003. Background to framenet. International journal of lexicography, Vol. 16, 3 (2003), 235--250.
[28]
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the International conference on machine learning. PMLR, 1263--1272.
[29]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group Contextualization for Video Recognition. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 928--938.
[30]
Jun-Yan He, Xiao Wu, Zhi-Qi Cheng, Zhaoquan Yuan, and Yu-Gang Jiang. 2021. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing, Vol. 444 (2021), 319--331.
[31]
Siyu Huang, Xi Li, Zhi-Qi Cheng, Zhongfei Zhang, and Alexander Hauptmann. 2018. Gnas: A greedy neural architecture search method for multi-attribute learning. In Proceedings of the ACM international conference on Multimedia. 2049--2057.
[32]
Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1655--1663.
[33]
Paul R Kingsbury and Martha Palmer. 2002. From TreeBank to PropBank. In LREC. Citeseer, 1989--1993.
[34]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[35]
Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision. 10313--10322.
[36]
Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, and Sanja Fidler. 2017b. Situation recognition with graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 4173--4182.
[37]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017a. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.
[38]
Yujia Li, Richard Zemel, Marc Brockschmidt, and Daniel Tarlow. 2016. Gated Graph Sequence Neural Networks. In Proceedings of the International Conference on Learning Representations.
[39]
Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[40]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.
[41]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[42]
Arun Mallya and Svetlana Lazebnik. 2017. Recurrent models for situation recognition. In Proceedings of the IEEE International Conference on Computer Vision. 455--463.
[43]
Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, and Chong-Wah Ngo. 2017. Vireo@ trecvid 2017: Video-to-text, ad-hoc video search and video hyperlinking. (2017).
[44]
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision. 3456--3465.
[45]
Martha Palmer, Daniel Gildea, and Nianwen Xue. 2010. Semantic role labeling. Synthesis Lectures on Human Language Technologies, Vol. 3, 1 (2010), 1--103.
[46]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proceedings of the International Conference on Machine Learning. PMLR, 4055--4064.
[47]
Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, and Aniruddha Kembhavi. 2020. Grounded situation recognition. In Proceedings of the European Conference on Computer Vision. Springer, 314--332.
[48]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.
[49]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7008--7024.
[50]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.
[51]
Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. 2021. Visual Semantic Role Labeling for Video Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5589--5600.
[52]
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. 2022. From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[53]
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199 (2018).
[54]
Mohammed Suhail and Leonid Sigal. 2019. Mixture-kernel graph attention network for situation recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10363--10372.
[55]
Guang-Lu Sun, Zhi-Qi Cheng, Xiao Wu, and Qiang Peng. 2018. Personalized clothing recommendation combining user social circle and fashion style consistency. Multimedia Tools and Applications, Vol. 77, 14 (2018), 17731--17754.
[56]
Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, and Jingjing Liu. 2021. Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 982--997.
[57]
Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10781--10790.
[58]
Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[59]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.
[60]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.
[61]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[62]
Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. Moviegraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8581--8590.
[63]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision. 3551--3558.
[64]
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, and Tat-Seng Chua. 2022. Rethinking the Two-Stage Framework for Grounded Situation Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
[65]
Meng Wei, Chun Yuan, Xiaoyu Yue, and Kuo Zhong. 2020. Hose-net: Higher order structure embedded network for scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia. 1846--1854.
[66]
Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2021. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. arXiv preprint arXiv:2112.06197 (2021).
[67]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.
[68]
Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, and Qi Wu. 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637--12646.
[69]
Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591--600.
[70]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision. 670--685.
[71]
Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and Ali Farhadi. 2017. Commonly uncommon: Semantic sparsity in situation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7196--7205.
[72]
Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5534--5542.
[73]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10502--10511.
[74]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651--4659.
[75]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.
[76]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In Proc. of the ACM International Conference on Multimedia. 917--925.
[77]
Dora Zhao, Angelina Wang, and Olga Russakovsky. 2021. Understanding and Evaluating Racial Biases in Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14830--14840.
[78]
Zhichen Zhao, Huimin Ma, and Shaodi You. 2017. Single image action recognition using semantic body part actions. In Proceedings of the IEEE international conference on computer vision. 3391--3399.
[79]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations.io

Cited By

View all

Index Terms

  1. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. grounded situation recognition
    2. transformer framework

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)278
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media