research-article

Open access

OccluBEV: Occlusion Aware Spatiotemporal Modeling for Multi-view 3D Object Detection

Authors:

Haibo FanAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4074 - 4083

https://doi.org/10.1145/3581783.3613798

Published: 27 October 2023 Publication History

Abstract

Bird's-Eye-View (BEV) based 3D visual perception, which formulates a unified space for multi-view representation, has received wide attention in autonomous driving due to its scalability for downstream tasks. However, view transform in transformer-based BEV methods is agnostic of 3D occlusion relationships, resulting in model degradation. To construct a higher-quality BEV space, this paper analyzes the mutual occlusion problems in the view transform process and proposes a new transformer-based method named OccluBEV. OccluBEV alleviates the occlusion issue via point cloud information distillation in both the image and BEV space. Specifically, in the image space, we perform depth estimation for each pixel and utilize it to guide image feature mapping. Further, since predicting depth directly from monocular image is ill-posed, ignoring stereo information such as multi-view and temporal cues, this paper introduces a voxel visibility segmentation task in 3D BEV space. The task explicitly predicts whether each voxel in the 3D BEV grid is occupied or not. In addition, to alleviate the overfitting problem in BEV feature learning under a single task, we design a multi-head learning framework which jointly models multiple strongly-correlated tasks in a unified BEV space. The effectiveness of the proposed method is fully validated on the nuScenes dataset, achieving a competetive NDS/mAP score of 57.5/47.9 on the nuScenes test leaderboard using ResNet101 backbone, which is superior to state-of-the-art camera-based solutions.

References

[1]

Garrick Brazil and Xiaoming Liu. 2019. M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9287--9296.

[2]

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11621--11631.

[3]

Yigit Baran Can, Alexander Liniger, Danda Pani Paudel, and Luc Van Gool. 2021. Structured bird's-eye-view traffic scene understanding from onboard images. In ICCV. 15661--15670.

[4]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16. Springer, 213--229.

Digital Library

[5]

Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1907--1915.

[6]

Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. 2020. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12093--12102.

[7]

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision. 764--773.

[8]

Jian Deng and Krzysztof Czarnecki. 2019. MLOD: A multi-view 3D object detection based on robust feature fusion method. In 2019 IEEE intelligent transportation systems conference (ITSC). IEEE, 279--284.

[9]

Anthony Hu, Zak Murez, Nikhil Mohan, Sofía Dudas, Jeffrey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. 2021. FIERY: future instance prediction in bird's-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15273--15282.

[10]

Jordan SK Hu, Tianshu Kuai, and Steven L Waslander. 2022. Point density-aware voxels for lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8469--8478.

[11]

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. 2021. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021).

[12]

Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12697--12705.

[13]

Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. 2019. An energy and GPU-computation efficient backbone network for real-time object detection. In CVPR workshops.

[14]

Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiaogang Wang. 2019. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1019--1028.

[15]

Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Enze Xie, Zhiqi Li, Hanming Deng, Hao Tian, et al. 2022d. Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe. arXiv preprint arXiv:2209.05324 (2022).

[16]

Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, and Zeming Li. 2022a. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo. arXiv preprint arXiv:2209.10248 (2022).

[17]

Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. 2022b. Unifying voxel-based representation with transformer for 3d object detection. arXiv preprint arXiv:2206.00630 (2022).

[18]

Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. 2022c. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092 (2022).

[19]

Zhichao Li, Feng Wang, and Naiyan Wang. 2021. Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7546--7555.

[20]

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. 2022 e. Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX. Springer, 1--18.

[21]

Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Qi Gao, Tiancai Wang, Xiangyu Zhang, and Jian Sun. 2022. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256 (2022).

[22]

Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, and Elisa Ricci. 2022. 3D object detection from images for autonomous driving: a survey. arXiv preprint arXiv:2202.02980 (2022).

[23]

Xinzhu Ma, Zhihui Wang, Haojie Li, Pengbo Zhang, Wanli Ouyang, and Xin Fan. 2019. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6851--6860.

[24]

Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan. 2022. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprint arXiv:2210.02443 (2022).

[25]

Jonah Philion and Sanja Fidler. 2020. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV 16. Springer, 194--210.

Digital Library

[26]

Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. 2018. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 918--927.

[27]

Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. 2021. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8555--8564.

[28]

Thomas Roddick, Alex Kendall, and Roberto Cipolla. 2018. Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188 (2018).

[29]

Danila Rukhovich, Anna Vorontsova, and Anton Konushin. 2022. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2397--2406.

[30]

Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 770--779.

[31]

Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. 2020. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence, Vol. 43, 8 (2020), 2647--2664.

[32]

Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. 2022c. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning. PMLR, 1475--1485.

[33]

Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. 2021. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 913--922.

[34]

Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. 2019. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8445--8453.

[35]

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. 2022a. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning. PMLR, 180--191.

[36]

Zengran Wang, Chen Min, Zheng Ge, Yinhao Li, Zeming Li, Hongyu Yang, and Di Huang. 2022b. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv preprint arXiv:2208.10145 (2022).

[37]

Xinshuo Weng and Kris Kitani. 2019. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0-0.

[38]

Turner Whitted. 2005. An improved illumination model for shaded display. In ACM Siggraph 2005 Courses. 4-es.

[39]

Pengxiang Wu, Siheng Chen, and Dimitris N Metaxas. 2020. Motionnet: Joint perception and motion prediction for autonomous driving based on bird's eye view maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11385--11395.

[40]

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. 2022. BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision. arXiv preprint arXiv:2211.10439 (2022).

[41]

Ze Yang and Liwei Wang. 2019. Learning relationships for multi-view 3D object recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 7505--7514.

[42]

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. 2021. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11784--11793.

[43]

Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. [n.,d.]. Pseudo-LiDAR: Accurate Depth for 3D Object Detection in Autonomous Driving. In International Conference on Learning Representations.

[44]

Yunpeng Zhang, Jiwen Lu, and Jie Zhou. 2021. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3289--3298.

[45]

Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. 2022. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743 (2022).

[46]

Yin Zhou and Oncel Tuzel. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4490--4499.

[47]

Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. 2019. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492 (2019).

Cited By

Liao WQiang SLi XChen XWang HLiang YYan JHe TPeng PCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous DrivingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680572(9145-9154)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680572

Index Terms

OccluBEV: Occlusion Aware Spatiotemporal Modeling for Multi-view 3D Object Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of ...
Track initialization and re-identification for 3D multi-view multi-object tracking
Abstract
We propose a 3D multi-object tracking (MOT) solution using only 2D detections from monocular cameras, which automatically initiates/terminates tracks as well as resolves track appearance–reappearance and occlusions. Moreover, this approach does ...
Highlights
- Novel 3D multi-object tracking models with re-identification features.
- A filter that performs 3D tracking by fusing multi-view 2D camera detections.
- Our Method automatically initializes/terminates, re-identifies, and handles ...
Wide-baseline multi-view video segmentation for 3D reconstruction
3DVP '10: Proceedings of the 1st international workshop on 3D video processing

Obtaining a foreground silhouette across multiple views is one of the fundamental steps in 3D reconstruction. In this paper we present a novel video segmentation approach, to obtain a foreground silhouette, for scenes captured by a wide-baseline camera ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
706
Total Downloads

Downloads (Last 12 months)479
Downloads (Last 6 weeks)34

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liao WQiang SLi XChen XWang HLiang YYan JHe TPeng PCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)CalibRBEV: Multi-Camera Calibration via Reversed Bird's-eye-view Representations for Autonomous DrivingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680572(9145-9154)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680572

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents