skip to main content
10.1007/978-3-031-19815-1_4guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning

Published: 23 October 2022 Publication History

Abstract

Scene graph generation (SGG) is a fundamental task aimed at detecting visual relations between objects in an image. The prevailing SGG methods require all object classes to be given in the training set. Such a closed setting limits the practical application of SGG. In this paper, we introduce open-vocabulary scene graph generation, a novel, realistic and challenging setting, in which a model is trained on a set of base object classes but is required to infer relations for unseen target object classes. To this end, we propose a two-step method which firstly pre-trains on large amounts of coarse-grained region-caption data and then leverages two prompt-based techniques to finetune the pre-trained model without updating its parameters. Moreover, our method is able to support inference over completely unseen object classes, which existing methods are incapable of handling. On extensive experiments on three benchmark datasets, Visual Genome, GQA and Open-Image, our method significantly outperforms recent, strong SGG methods on the setting of Ov-SGG, as well as on the conventional closed SGG.

References

[1]
Antol, S., et al.: VQA: visual question answering. In: CVPR, pp. 2425–2433 (2015)
[2]
Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5664–5673 (2019)
[3]
Baldassarre F, Smith K, Sullivan J, and Azizpour H Vedaldi A, Bischof H, Brox T, and Frahm J-M Explanation-based weakly-supervised learning of visual relations with graph networks Computer Vision – ECCV 2020 2020 Cham Springer 612-630
[4]
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
[5]
Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., Hauptmann, A.: Scene graphs: a survey of generations and applications. arXiv preprint arXiv:2104.01111 (2021)
[6]
Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: CVPR, pp. 6163–6171 (2019)
[7]
Chen, X., et al.: AdaPrompt: adaptive prompt-based finetuning for relation extraction. arXiv preprint arXiv:2104.07650 (2021)
[8]
Davison, J., Feldman, J., Rush, A.M.: Commonsense knowledge mining from pretrained models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1173–1178 (2019)
[9]
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: ICCV (2021)
[10]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
[11]
Han, X., Yang, J., Hu, H., Zhang, L., Gao, J., Zhang, P.: Image scene graph generation (SGG) benchmark. arXiv preprint arXiv:2107.12604 (2021)
[12]
He, T., Gao, L., Song, J., Cai, J., Li, Y.F.: Learning from the scene and borrowing from the rich: tackling the long tail in scene graph generation. In: IJCAI (2020)
[13]
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
[14]
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
[15]
Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)
[16]
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
[17]
Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A., Belilovsky, E.: Graph density-aware losses for novel compositions in scene graph generation. In: BMVC (2020)
[18]
Knyazev, B., de Vries, H., Cangea, C., Taylor, G.W., Courville, A., Belilovsky, E.: Generative compositional augmentations for scene graph prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15827–15837, October 2021
[19]
Krishna R et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations IJCV 2017 123 1 32-73
[20]
Kuznetsova A et al. The open images dataset V4 Int. J. Comput. Vis. 2020 128 7 1956-1981
[21]
Li, R., Zhang, S., Wan, B., He, X.: Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11109–11119 (2021)
[22]
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL (2021)
[23]
Li X et al. Vedaldi A, Bischof H, Brox T, Frahm J-M, et al. Oscar: object-semantics aligned pre-training for vision-language tasks Computer Vision – ECCV 2020 2020 Cham Springer 121-137
[24]
Lin T-Y et al. Fleet D, Pajdla T, Schiele B, Tuytelaars T, et al. Microsoft COCO: common objects in context Computer Vision – ECCV 2014 2014 Cham Springer 740-755
[25]
Lin, X., Ding, C., Zeng, J., Tao, D.: GPS-Net: graph property sensing network for scene graph generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
[26]
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)
[27]
Lu C, Krishna R, Bernstein M, and Fei-Fei L Leibe B, Matas J, Sebe N, and Welling M Visual relationship detection with language priors Computer Vision – ECCV 2016 2016 Cham Springer 852-869
[28]
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
[29]
Peters, M.E., et al.: Deep contextualized word representations. In: NAC-ACL (2018)
[30]
Petroni, F., et al.: Language models as knowledge bases? In: EMNLP-IJCNLP (2019)
[31]
Peyre, J., Sivic, J., Laptev, I., Schmid, C.: Weakly-supervised learning of visual relations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5179–5188 (2017)
[32]
Qin, G., Eisner, J.: Learning how to ask: querying LMS with mixtures of soft prompts. In: ACL (2021)
[33]
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
[34]
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
[35]
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
[36]
Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the fOurth Workshop on Vision and Language, pp. 70–80 (2015)
[37]
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: ICLR (2020)
[38]
Suhail, M., et al.: Energy-based learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13936–13945 (2021)
[39]
Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: CVPR (2020)
[40]
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6619–6628 (2019)
[41]
Teney, D., Liu, L., van Den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR, pp. 1–9 (2017)
[42]
Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847 (2018)
[43]
Wang, T.J.J., Pehlivan, S., Laaksonen, J.: Tackling the unannotated: Scene graph generation with bias-reduced models. In: BMVC (2020)
[44]
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. In: EMNLP (2020)
[45]
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)
[46]
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10685–10694 (2019)
[47]
Yao T, Pan Y, Li Y, and Mei T Ferrari V, Hebert M, Sminchisescu C, and Weiss Y Exploring visual relationship for image captioning Computer Vision – ECCV 2018 2018 Cham Springer 711-727
[48]
Ye, K., Kovashka, A.: Linguistic structures as weak supervision for visual scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8289–8299 (2021)
[49]
Zareian A, Karaman S, and Chang S-F Vedaldi A, Bischof H, Brox T, and Frahm J-M Bridging knowledge graphs to generate scene graphs Computer Vision – ECCV 2020 2020 Cham Springer 606-623
[50]
Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3736–3745 (2020)
[51]
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)
[52]
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
[53]
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)
[54]
Zhang, C., Yu, J., Song, Y., Cai, W.: Exploiting edge-oriented reasoning for 3D point-based scene graph analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9705–9715 (2021)
[55]
Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE Conference on Computer Vision and PATTERN RECOGNITION, pp. 5532–5540 (2017)
[56]
Zhang, P., et al.: VINVL: revisiting visual representations in vision-language models. In: CVPR, pp. 5579–5588 (2021)
[57]
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)
[58]
Zhong, Y., Shi, J., Yang, J., Xu, C., Li, Y.: Learning to generate scene graph from natural language supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1823–1834 (2021)

Cited By

View all
  • (2024)Knowledge Distillation for Single Image Super-Resolution via Contrastive LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657606(1079-1083)Online publication date: 30-May-2024
  • (2023)Zero-shot visual relation detection via composite visual cues from large language modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668301(50105-50116)Online publication date: 10-Dec-2023
  • (2023)Open visual knowledge extraction via relation-oriented multimodality model promptingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667141(23499-23519)Online publication date: 10-Dec-2023
  • Show More Cited By

Index Terms

  1. Towards Open-Vocabulary Scene Graph Generation with Prompt-Based Finetuning
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII
      Oct 2022
      805 pages
      ISBN:978-3-031-19814-4
      DOI:10.1007/978-3-031-19815-1

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 23 October 2022

      Author Tags

      1. Open-vocabulary scene graph generation
      2. Visual-language model pretraining
      3. Prompt-based finetuning

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 14 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Knowledge Distillation for Single Image Super-Resolution via Contrastive LearningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657606(1079-1083)Online publication date: 30-May-2024
      • (2023)Zero-shot visual relation detection via composite visual cues from large language modelsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668301(50105-50116)Online publication date: 10-Dec-2023
      • (2023)Open visual knowledge extraction via relation-oriented multimodality model promptingProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667141(23499-23519)Online publication date: 10-Dec-2023
      • (2023)Open-Vocabulary Object Detection via Scene Graph DiscoveryProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612407(4012-4021)Online publication date: 26-Oct-2023

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media