skip to main content
10.1145/3590003.3590039acmotherconferencesArticle/Chapter ViewAbstractPublication PagescacmlConference Proceedingsconference-collections
research-article

A Component for Query-based Object Detection in Crowded Scenes

Published: 29 May 2023 Publication History

Abstract

Query-based object detection, including DETR and Sparse R-CNN, has gained considerable attention in recent years. However, in dense scenes, end-to-end object detection methods are prone to false positives. To address this issue, we propose a graph convolution-based post-processing component to refine the output results from Sparse R-CNN. Specifically, we initially select high-scoring queries to generate true positive predictions. Subsequently, the query updater refines noisy query features using GCN. Lastly, the label assignment rule matches accepted predictions to ground truth objects, eliminates matched targets, and associates noisy predictions with the remaining ground truth objects. Our method significantly enhances performance in crowded scenes. Our method achieves 92.3% AP and 41.6% on CrowdHuman dataset, which is a challenging objection detection dataset.

References

[1]
Redmon J, Divvala S, Girshick R, You only look once: Unified, real­time object detection [C]. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 779–788.
[2]
Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6154–6162 (2018)
[3]
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448 (2021)
[4]
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015)
[5]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229, 2020.
[6]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020.
[7]
Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.
[8]
Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv preprint arXiv:1912.02424, 2019.
[9]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9626–9635, 2019.
[10]
Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection with fully convolutional network. arXiv preprint arXiv:2012.03544, 2020.
[11]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR 2021: The Ninth International Conference on Learning Representations, 2021
[12]
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., : Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14454–14463 (2021)
[13]
J. Hosang, R. Benenson, and B. Schiele. Learning non-maximum suppression. In CVPR, 2017.
[14]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and ´C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[16]
Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. Detection in crowded scenes: One proposal, multiple predictions. pages 12214–12223, 2020.
[17]
Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1601–1610, June 2021.
[18]
Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris Kitani. Rethinking transformer-based set prediction for object detection. CoRR, abs/2011.10881, 2020.
[19]
Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Li, and Xudong Zou. Pedhunter: Occlusion robust pedestrian detector in crowded scenes. In AAAI 2020: The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
[20]
Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Li, and Xudong Zou. Relational learning for joint head and human detection. In AAAI 2020: The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
[21]
Kevin Zhang, Feng Xiong, Peize Sun, Li Hu, Boxun Li, and Gang Yu. Double anchor r-cnn for human detection in a crowd. arXiv preprint arXiv:1909.09998, 2019.
[22]
Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. Repulsion loss: Detecting pedestrians in a crowd. arXiv preprint arXiv:1711.07752, 2017.
[23]
Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Occlusion-aware r-cnn: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pages 637–653, 2018.
[24]
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. arXiv preprint arXiv:1704.04503, 2017.
[25]
Yihui He, Xiangyu Zhang, Marios Savvides, and Kris Kitani. Softer-nms: Rethinking bounding box regression for accurate object detection. 2018.
[26]
J. Hosang, R. Benenson, and B. Schiele. Learning non-maximum suppression. In CVPR, 2017.
[27]
Lu Qi, Shu Liu, Jianping Shi, and Jiaya Jia. Sequential context encoding for duplicate removal. In Advances in Neural Information Processing Systems, volume 31, pages 2049– 2058, 2018.
[28]
Jan Hendrik Hosang, Rodrigo Benenson, and Bernt Schiele. A convnet for non-maximum suppression. In 38th German Conference on Pattern Recognition, pages 192–204, 2016.
[29]
Songtao Liu, Di Huang, and Yunhong Wang. Adaptive nms: Refining pedestrian detection in a crowd. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6459–6468, 2019.
[30]
Niels Ole Salscheider. Featurenms: Non-maximum suppression by learning feature embeddings. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 7848–7854, 2021.
[31]
Xin Huang, Zheng Ge, Zequn Jie, and Osamu Yoshie. Nms by representative region: Towards crowded pedestrian detection by proposal pairing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10750–10759, 2020.
[32]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: A simple and strong anchor-free object detector. 2021.
[33]
Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting humans in a crowd. arXiv preprint arXiv:1805.00123, 2018.

Index Terms

  1. A Component for Query-based Object Detection in Crowded Scenes
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning
          March 2023
          598 pages
          ISBN:9781450399449
          DOI:10.1145/3590003
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 29 May 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. End-to-end object detection
          2. Graph convolution
          3. Query

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          CACML 2023

          Acceptance Rates

          CACML '23 Paper Acceptance Rate 93 of 241 submissions, 39%;
          Overall Acceptance Rate 93 of 241 submissions, 39%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 32
            Total Downloads
          • Downloads (Last 12 months)23
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 15 Sep 2024

          Other Metrics

          Citations

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media