Article

Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning

Authors:

Yuan-Fang LiAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII

Pages 56 - 73

https://doi.org/10.1007/978-3-031-19815-1_4

Published: 23 October 2022 Publication History

Abstract

Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. The prevailing SGG methods require all object classes to be given in the training set. Such a closed setting limits the practical application of SGG. In this paper, we introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting, in which a model is trained on a set of base object classes but is required to infer relations for unseen target object classes. To this end, we propose a two-step method which firstly pre-trains on large amounts of coarse-grained region-caption data and then leverages two prompt-based techniques to finetune the pre-trained model without updating its parameters. Moreover, our method is able to support inference over completely unseen object classes, which existing methods are incapable of handling. On extensive experiments on three benchmark datasets, Visual Genome, GQA and Open-Image, our method significantly outperforms recent, strong SGG methods on the setting of Ov-SGG, as well as on the conventional closed SGG.

References

[1]

Antol, S., et al.: VQA: visual question answering. In: CVPR, pp. 2425–2433 (2015)

[2]

Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5664–5673 (2019)

[3]

Baldassarre F, Smith K, Sullivan J, and Azizpour H Vedaldi A, Bischof H, Brox T, and Frahm J-M Explanation-based weakly-supervised learning of visual relations with graph networks Computer Vision – ECCV 2020 2020 Cham Springer 612-630

[4]

Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

[5]

Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., Hauptmann, A.: Scene graphs: a survey of generations and applications. arXiv preprint arXiv:2104.01111 (2021)

[6]

Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: CVPR, pp. 6163–6171 (2019)

[7]

Chen, X., et al.: AdaPrompt: adaptive prompt-based finetuning for relation extraction. arXiv preprint arXiv:2104.07650 (2021)

[8]

Davison, J., Feldman, J., Rush, A.M.: Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1173–1178 (2019)

[9]

Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: ICCV (2021)

[10]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)

[11]

Han, X., Yang, J., Hu, H., Zhang, L., Gao, J., Zhang, P.: Image scene graph generation (SGG) benchmark. arXiv preprint arXiv:2107.12604 (2021)

[12]

He, T., Gao, L., Song, J., Cai, J., Li, Y.F.: Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation. In: IJCAI (2020)

[13]

Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)

[14]

Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)

[15]

Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)

[16]

Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)

[17]

Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A., Belilovsky, E.: Graph density-aware losses for novel compositions in scene graph generation. In: BMVC (2020)

[18]

Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A., Belilovsky, E.: Generative compositional augmentations for scene graph prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15827–15837, October 2021

[19]

Krishna R et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations IJCV 2017 123 1 32-73

[20]

Kuznetsova A et al. The open images dataset V4 Int. J. Comput. Vis. 2020 128 7 1956-1981

[21]

Li, R., Zhang, S., Wan, B., He, X.: Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11109–11119 (2021)

[22]

Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL (2021)

[23]

Li X et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. Oscar: object-semantics aligned pre-training for vision-language tasks Computer Vision – ECCV 2020 2020 Cham Springer 121-137

[24]

Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755

[25]

Lin, X., Ding, C., Zeng, J., Tao, D.: GPS-Net: graph property sensing network for scene graph generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

[26]

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)

[27]

Lu C, Krishna R, Bernstein M, and Fei-Fei L Leibe B, Matas J, Sebe N, and Welling M Visual relationship detection with language priors Computer Vision – ECCV 2016 2016 Cham Springer 852-869

[28]

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

[29]

Peters, M.E., et al.: Deep contextualized word representations. In: NAC-ACL (2018)

[30]

Petroni, F., et al.: Language models as knowledge bases? In: EMNLP-IJCNLP (2019)

[31]

Peyre, J., Sivic, J., Laptev, I., Schmid, C.: Weakly-supervised learning of visual relations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5179–5188 (2017)

[32]

Qin, G., Eisner, J.: Learning how to ask: querying LMS with mixtures of soft prompts. In: ACL (2021)

[33]

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

[34]

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

[35]

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)

[36]

Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the fOurth Workshop on Vision and Language, pp. 70–80 (2015)

[37]

Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)

[38]

Suhail, M., et al.: Energy-based learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13936–13945 (2021)

[39]

Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: CVPR (2020)

[40]

Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6619–6628 (2019)

[41]

Teney, D., Liu, L., van Den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR, pp. 1–9 (2017)

[42]

Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847 (2018)

[43]

Wang, T.J.J., Pehlivan, S., Laaksonen, J.: Tackling the unannotated: Scene graph generation with bias-reduced models. In: BMVC (2020)

[44]

Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. In: EMNLP (2020)

[45]

Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)

[46]

Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10685–10694 (2019)

[47]

Yao T, Pan Y, Li Y, and Mei T Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Exploring visual relationship for image captioning Computer Vision – ECCV 2018 2018 Cham Springer 711-727

[48]

Ye, K., Kovashka, A.: Linguistic structures as weak supervision for visual scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8289–8299 (2021)

[49]

Zareian A, Karaman S, and Chang S-F Vedaldi A, Bischof H, Brox T, and Frahm J-M Bridging knowledge graphs to generate scene graphs Computer Vision – ECCV 2020 2020 Cham Springer 606-623

[50]

Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3736–3745 (2020)

[51]

Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)

[52]

Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

[53]

Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)

[54]

Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3D point-based scene graph analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9705–9715 (2021)

[55]

Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE Conference on Computer Vision and PATTERN RECOGNITION, pp. 5532–5540 (2017)

[56]

Zhang, P., et al.: VINVL: revisiting visual representations in vision-language models. In: CVPR, pp. 5579–5588 (2021)

[57]

Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)

[58]

Zhong, Y., Shi, J., Yang, J., Xu, C., Li, Y.: Learning to generate scene graph from natural language supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1823–1834 (2021)

Cited By

Liu CZhang DQin KGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Knowledge Distillation for Single Image Super-Resolution via Contrastive LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657606(1079-1083)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3657606
Li LXiao JChen GShao JZhuang YChen LOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Zero-shot visual relation detection via composite visual cues from large language modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668301(50105-50116)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668301
Cui HFang XZhang ZXu RKan XLiu XYu YLi MSong YYang COh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Open visual knowledge extraction via relation-oriented multimodality model promptingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667141(23499-23519)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667141
Show More Cited By

Index Terms

Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Open-Vocabulary Object Detection via Scene Graph Discovery
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works ...
Boosting Scene Graph Generation with Visual Relation Saliency
The scene graph is a symbolic data structure that comprehensively describes the objects and visual relations in a visual scene, while ignoring the inherent perceptual saliency of each visual relation (i.e., relation saliency). However, humans often ...
Graph Attention Based Variational Adversarial Graph Generation Method
ICCAI '23: Proceedings of the 2023 9th International Conference on Computing and Artificial Intelligence

With the wide application of graph structure data in various fields of daily life, the research on graph data based on deep learning has also received extensive attention. The adversarial learning method is a common generation method, in which the one-...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII

Oct 2022

805 pages

ISBN:978-3-031-19814-4

DOI:10.1007/978-3-031-19815-1

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu CZhang DQin KGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Knowledge Distillation for Single Image Super-Resolution via Contrastive LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657606(1079-1083)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3657606
Li LXiao JChen GShao JZhuang YChen LOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Zero-shot visual relation detection via composite visual cues from large language modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668301(50105-50116)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668301
Cui HFang XZhang ZXu RKan XLiu XYu YLi MSong YYang COh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Open visual knowledge extraction via relation-oriented multimodality model promptingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667141(23499-23519)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667141
Shi HHayat MCai JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Open-Vocabulary Object Detection via Scene Graph DiscoveryProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612407(4012-4021)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612407

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents