skip to main content
10.1145/3581783.3613798acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

OccluBEV: Occlusion Aware Spatiotemporal Modeling for Multi-view 3D Object Detection

Published: 27 October 2023 Publication History

Abstract

Bird's-Eye-View (BEV) based 3D visual perception, which formulates a unified space for multi-view representation, has received wide attention in autonomous driving due to its scalability for downstream tasks. However, view transform in transformer-based BEV methods is agnostic of 3D occlusion relationships, resulting in model degradation. To construct a higher-quality BEV space, this paper analyzes the mutual occlusion problems in the view transform process and proposes a new transformer-based method named OccluBEV. OccluBEV alleviates the occlusion issue via point cloud information distillation in both the image and BEV space. Specifically, in the image space, we perform depth estimation for each pixel and utilize it to guide image feature mapping. Further, since predicting depth directly from monocular image is ill-posed, ignoring stereo information such as multi-view and temporal cues, this paper introduces a voxel visibility segmentation task in 3D BEV space. The task explicitly predicts whether each voxel in the 3D BEV grid is occupied or not. In addition, to alleviate the overfitting problem in BEV feature learning under a single task, we design a multi-head learning framework which jointly models multiple strongly-correlated tasks in a unified BEV space. The effectiveness of the proposed method is fully validated on the nuScenes dataset, achieving a competetive NDS/mAP score of 57.5/47.9 on the nuScenes test leaderboard using ResNet101 backbone, which is superior to state-of-the-art camera-based solutions.

References

[1]
Garrick Brazil and Xiaoming Liu. 2019. M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9287--9296.
[2]
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11621--11631.
[3]
Yigit Baran Can, Alexander Liniger, Danda Pani Paudel, and Luc Van Gool. 2021. Structured bird's-eye-view traffic scene understanding from onboard images. In ICCV. 15661--15670.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16. Springer, 213--229.
[5]
Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1907--1915.
[6]
Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. 2020. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12093--12102.
[7]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision. 764--773.
[8]
Jian Deng and Krzysztof Czarnecki. 2019. MLOD: A multi-view 3D object detection based on robust feature fusion method. In 2019 IEEE intelligent transportation systems conference (ITSC). IEEE, 279--284.
[9]
Anthony Hu, Zak Murez, Nikhil Mohan, Sofía Dudas, Jeffrey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. 2021. FIERY: future instance prediction in bird's-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15273--15282.
[10]
Jordan SK Hu, Tianshu Kuai, and Steven L Waslander. 2022. Point density-aware voxels for lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8469--8478.
[11]
Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. 2021. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021).
[12]
Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12697--12705.
[13]
Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. 2019. An energy and GPU-computation efficient backbone network for real-time object detection. In CVPR workshops.
[14]
Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. 2019. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1019--1028.
[15]
Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Enze Xie, Zhiqi Li, Hanming Deng, Hao Tian, et al. 2022d. Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe. arXiv preprint arXiv:2209.05324 (2022).
[16]
Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. 2022a. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo. arXiv preprint arXiv:2209.10248 (2022).
[17]
Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. 2022b. Unifying voxel-based representation with transformer for 3d object detection. arXiv preprint arXiv:2206.00630 (2022).
[18]
Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. 2022c. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092 (2022).
[19]
Zhichao Li, Feng Wang, and Naiyan Wang. 2021. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7546--7555.
[20]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. 2022 e. Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX. Springer, 1--18.
[21]
Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Qi Gao, Tiancai Wang, Xiangyu Zhang, and Jian Sun. 2022. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256 (2022).
[22]
Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, and Elisa Ricci. 2022. 3D object detection from images for autonomous driving: a survey. arXiv preprint arXiv:2202.02980 (2022).
[23]
Xinzhu Ma, Zhihui Wang, Haojie Li, Pengbo Zhang, Wanli Ouyang, and Xin Fan. 2019. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6851--6860.
[24]
Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan. 2022. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprint arXiv:2210.02443 (2022).
[25]
Jonah Philion and Sanja Fidler. 2020. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV 16. Springer, 194--210.
[26]
Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. 2018. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 918--927.
[27]
Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. 2021. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8555--8564.
[28]
Thomas Roddick, Alex Kendall, and Roberto Cipolla. 2018. Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188 (2018).
[29]
Danila Rukhovich, Anna Vorontsova, and Anton Konushin. 2022. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2397--2406.
[30]
Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 770--779.
[31]
Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. 2020. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 8 (2020), 2647--2664.
[32]
Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. 2022c. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning. PMLR, 1475--1485.
[33]
Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. 2021. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 913--922.
[34]
Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. 2019. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8445--8453.
[35]
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. 2022a. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning. PMLR, 180--191.
[36]
Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, and Di Huang. 2022b. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv preprint arXiv:2208.10145 (2022).
[37]
Xinshuo Weng and Kris Kitani. 2019. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0-0.
[38]
Turner Whitted. 2005. An improved illumination model for shaded display. In ACM Siggraph 2005 Courses. 4-es.
[39]
Pengxiang Wu, Siheng Chen, and Dimitris N Metaxas. 2020. Motionnet: Joint perception and motion prediction for autonomous driving based on bird's eye view maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11385--11395.
[40]
Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. 2022. BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision. arXiv preprint arXiv:2211.10439 (2022).
[41]
Ze Yang and Liwei Wang. 2019. Learning relationships for multi-view 3D object recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 7505--7514.
[42]
Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11784--11793.
[43]
Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. [n.,d.]. Pseudo-LiDAR: Accurate Depth for 3D Object Detection in Autonomous Driving. In International Conference on Learning Representations.
[44]
Yunpeng Zhang, Jiwen Lu, and Jie Zhou. 2021. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3289--3298.
[45]
Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. 2022. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743 (2022).
[46]
Yin Zhou and Oncel Tuzel. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4490--4499.
[47]
Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. 2019. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492 (2019).

Cited By

View all
  • (2024)CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous DrivingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680572(9145-9154)Online publication date: 28-Oct-2024

Index Terms

  1. OccluBEV: Occlusion Aware Spatiotemporal Modeling for Multi-view 3D Object Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d perception
    2. bird's eye view
    3. multi-view
    4. obejct detection

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)479
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous DrivingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680572(9145-9154)Online publication date: 28-Oct-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media