research-article

Open-Vocabulary Object Detection via Scene Graph Discovery

Authors:

Jianfei CaiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4012 - 4021

https://doi.org/10.1145/3581783.3612407

Published: 27 October 2023 Publication History

Abstract

In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.

References

[1]

Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 384--400.

Digital Library

[2]

Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, and Stefano Soatto. 2022. X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks. European Conference on Computer Vision (2022).

[3]

Meng Cao, Ji Jiang, Long Chen, and Yuexian Zou. 2022. Correspondence matters for video referring expression comprehension. In Proceedings of the 30th ACM International Conference on Multimedia. 4967--4976.

Digital Library

[4]

Shaoxiang Chen. 2021. Towards bridging video and language by caption generation and sentence localization. In Proceedings of the 29th ACM International Conference on Multimedia. 2964--2968.

Digital Library

[5]

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. 2022. Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14084--14093.

[6]

Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. 2022. Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision. 701--717.

Digital Library

[7]

Mingtao Feng, Haoran Hou, Liang Zhang, Yulan Guo, Hongshan Yu, Yaonan Wang, and Ajmal Mian. 2023. Exploring Hierarchical Spatial Layout Cues for 3D Point Cloud based Scene Graph Prediction. IEEE Transactions on Multimedia (2023).

[8]

Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong. 2022. Open Vocabulary Object Detection with Pseudo Bounding-Box Labels. European Conference on Computer Vision (2022).

[9]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2022. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In International Conference on Learning Representations.

[10]

Agrim Gupta, Piotr Dollar, and Ross Girshick. 2019. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5356--5364.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[12]

Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. Towards open-vocabulary scene graph generation with prompt-based finetuning. In European Conference on Computer Vision. 56--73.

Digital Library

[13]

Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, and Ehsan Elhamifar. 2022. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020--7031.

[14]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.

[15]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2016).

[16]

Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. 2022. FindIt: Generalized Localization with Natural Language Queries. European Conference on Computer Vision (2022).

[17]

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. 2023. F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models. International Conference on Learning Representations (2023).

[18]

Liuwu Li, Yuqi Bu, and Yi Cai. 2021a. Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension. In Proceedings of the 29th ACM International Conference on Multimedia. 5167--5175.

Digital Library

[19]

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022c. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10965--10975.

[20]

Rongjie Li, Songyang Zhang, and Xuming He. 2022b. Sgtr: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19486--19496.

[21]

Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. 2021b. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11109--11119.

[22]

Xingchen Li, Long Chen, Wenbo Ma, Yi Yang, and Jun Xiao. 2022a. Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In Proceedings of the 30th ACM International Conference on Multimedia. 4204--4213.

Digital Library

[23]

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.

[24]

Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, and Jianfei Cai. 2023. Learning Object-Language Alignments for Open-Vocabulary Object Detection. International Conference on Learning Representations (2023).

[25]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.

[26]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[27]

Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2018. Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6985--6994.

[28]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV (2021).

[29]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. 852--869.

[30]

Fan Lyu, Wei Feng, and Song Wang. 2020. vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding. Neurocomputing, Vol. 413 (2020), 51--60.

[31]

Lufan Ma, Tiancai Wang, Bin Dong, Jiangpeng Yan, Xiu Li, and Xiangyu Zhang. 2021. Implicit feature refinement for instance segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 3088--3096.

Digital Library

[32]

Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. 2022. Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14074--14083.

[33]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019), 8026--8037.

[34]

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641--2649.

Digital Library

[35]

Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. 2020. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. In Proceedings of the 28th ACM International Conference on Multimedia. 4171--4180.

Digital Library

[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).

[37]

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, and Fahad Shahbaz Khan. 2022. Bridging the gap between object and image-level representations for open-vocabulary detection. Conference on Neural Information Processing Systems (2022).

[38]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 39, 06 (2017), 1137--1149.

Digital Library

[39]

Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language. Citeseer, 70--80.

[40]

Hengcan Shi, Munawar Hayat, and Jianfei Cai. 2023 a. Transformer scale gate for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3051--3060.

[41]

Hengcan Shi, Munawar Hayat, and Jianfei Cai. 2023 b. Unpaired referring expression grounding via bidirectional cross-modal matching. Neurocomputing, Vol. 518 (2023), 39--49.

Digital Library

[42]

Hengcan Shi, Munawar Hayat, Yicheng Wu, and Jianfei Cai. 2022. Proposalclip: Unsupervised open-category object proposal generation via exploiting clip cues. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9611--9620.

[43]

Hengcan Shi, Hongliang Li, Qingbo Wu, Fanman Meng, and King N Ngan. 2018. Boosting Scene Parsing Performance via Reliable Scale Prediction. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 492--500.

Digital Library

[44]

Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, and Chenliang Xu. 2021. A simple baseline for weakly-supervised scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16393--16402.

[45]

Suprosanna Shit, Rajat Koner, Bastian Wittmann, Johannes Paetzold, Ivan Ezhov, Hongwei Li, Jiazhen Pan, Sahand Sharifzadeh, Georgios Kaissis, Volker Tresp, et al. 2022. Relationformer: A unified framework for image-to-graph generation. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXVII. 422--439.

[46]

Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, and Xiu Li. 2021. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia. 4858--4862.

Digital Library

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[48]

Zheng Wang, Zhenwei Gao, Xing Xu, Yadan Luo, Yang Yang, and Heng Tao Shen. 2022. Point to Rectangle Matching for Image Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4977--4986.

Digital Library

[49]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.

[50]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9952--9961.

[51]

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-Vocabulary DETR with Conditional Matching. European Conference on Computer Vision (2022).

[52]

Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14393--14402.

[53]

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. 2022. Glipv2: Unifying localization and vision-language understanding. Conference on Neural Information Processing Systems (2022).

[54]

Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Exploiting unlabeled data with vision and language models for object detection. In European Conference on Computer Vision. 159--175.

Digital Library

[55]

Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. 2021. Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1823--1834.

[56]

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793--16803.

[57]

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip Krähenbühl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. European Conference on Computer Vision (2022).

Digital Library

[58]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.

Cited By

Lin CJiang YQu LYuan ZCai J(2024)Generative Region-Language Pretraining for Open-Ended Object Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01324(13958-13968)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01324
Shi HDao SCai J(2024)LLMFormer: Large Language Model for Open-Vocabulary Semantic SegmentationInternational Journal of Computer Vision10.1007/s11263-024-02171-yOnline publication date: 16-Aug-2024
https://doi.org/10.1007/s11263-024-02171-y

Index Terms

Open-Vocabulary Object Detection via Scene Graph Discovery
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Part-Aware Interactive Learning for Scene Graph Generation
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Generating scene graph to describe the whereabouts and interactions of objects in an image has attracted increasing attention of researchers. Most existing methods explore object-level visual context or bodypart-object cooperation with the message ...
Fast object detection based on selective visual attention

Selective visual attention plays an important role in human visual system. In real life, human visual system cannot handle all of the visual information captured by eyes on time. Selective visual attention filters the visual information and selects ...
Scene Graph Generation Based on Node-Relation Context Module
Neural Information Processing
Abstract
For better understanding an image, the relationships between objects can provide valuable spatial information and semantic clues besides recognition of all objects. However, current scene graph generation methods don’t effectively exploit the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Monash University
Australian Research Council

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
157
Total Downloads

Downloads (Last 12 months)157
Downloads (Last 6 weeks)11

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin CJiang YQu LYuan ZCai J(2024)Generative Region-Language Pretraining for Open-Ended Object Detection2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01324(13958-13968)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01324
Shi HDao SCai J(2024)LLMFormer: Large Language Model for Open-Vocabulary Semantic SegmentationInternational Journal of Computer Vision10.1007/s11263-024-02171-yOnline publication date: 16-Aug-2024
https://doi.org/10.1007/s11263-024-02171-y

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents