research-article

Multi-feature Fusion VoteNet for 3D Object Detection

Authors:

Jun WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 1

Article No.: 6, Pages 1 - 17

https://doi.org/10.1145/3462219

Published: 27 January 2022 Publication History

Abstract

In this article, we propose a Multi-feature Fusion VoteNet (MFFVoteNet) framework for improving the 3D object detection performance in cluttered and heavily occluded scenes. Our method takes the point cloud and the synchronized RGB image as inputs to provide object detection results in 3D space. Our detection architecture is built on VoteNet with three key designs. First, we augment the VoteNet input with point color information to enhance the difference of various instances in a scene. Next, we integrate an image feature module into the VoteNet to provide a strong object class signal that can facilitate deterministic detections in occlusion. Moreover, we propose a Projection Non-Maximum Suppression (PNMS) method in 3D object detection to eliminate redundant proposals and hence provide more accurate positioning of 3D objects. We evaluate the proposed MFFVoteNet on two challenging 3D object detection datasets, i.e., ScanNetv2 and SUN RGB-D. Extensive experiments show that our framework can effectively improve the performance of 3D object detection.

References

[1]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision. 5561–5569.

[2]

Jintai Chen, Biwen Lei, Qingyu Song, Haochao Ying, Danny Z. Chen, and Jian Wu. 2020. A hierarchical graph network for 3D object detection on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 392–401.

[3]

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 2016. Monocular 3D object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2147–2156.

[4]

Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G. Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 2015. 3D object proposals for accurate object class detection. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 424–432.

Digital Library

[5]

Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1907–1915.

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

[7]

Bertram Drost and Slobodan Ilic. 2012. 3D object detection and localization using multimodal point pair features. In Proceedings of the 2nd International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission. IEEE, 9–16.

Digital Library

[8]

Pablo Espinace, Thomas Kollar, Alvaro Soto, and Nicholas Roy. 2010. Indoor scene recognition through object detection. In Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 1406–1413.

[9]

Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, and Jana Kosecka. 2017. Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836 (2017).

[10]

C. M. Gevaert, C. Persello, R. Sliuzas, and G. Vosselman. 2017. Informal settlement classification using point-cloud and image-based features from UAV data. ISPRS J. Photogram. Rem. Sens. 125 (2017), 225–236.

[11]

Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S. Huang. 2016. Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016).

[12]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[14]

Chaoqun Hong, Jun Yu, Jian Wan, Dacheng Tao, and Meng Wang. 2015. Multimodal deep autoencoder for human pose recovery. IEEE Trans. Image Process. 24, 12 (2015), 5659–5670.

Digital Library

[15]

Ji Hou, Angela Dai, and Matthias Nießner. 2019. 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4421–4430.

[16]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.

[17]

Haowei Ji and Xianqi Luo. 2019. 3D scene reconstruction of landslide topography based on data fusion between laser point cloud and UAV image. Environ. Earth Sci. 78, 17 (2019), 1–12.

[18]

Jean Lahoud and Bernard Ghanem. 2017. 2D-driven 3D object detection in RGB-D images. In Proceedings of the IEEE International Conference on Computer Vision. 4622–4630.

[19]

Bo Li, Tianlei Zhang, and Tian Xia. 2016. Vehicle detection from 3D lidar using fully convolutional network. arXiv preprint arXiv:1608.07916 (2016).

[20]

Dahua Lin, Sanja Fidler, and Raquel Urtasun. 2013. Holistic scene understanding for 3D object detection with RGB-D cameras. In Proceedings of the IEEE International Conference on Computer Vision. 1417–1424.

Digital Library

[21]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2117–2125.

[22]

Weiping Liu, Jia Sun, Wanyi Li, Ting Hu, and Peng Wang. 2019. Deep learning on point clouds and its application: A survey. Sensors 19, 19 (2019), 4188. DOI:https://doi.org/10.3390/s19194188

[23]

Antoine Mauri, Redouane Khemmar, Benoit Decoux, Nicolas Ragot, Romain Rossi, Rim Trabelsi, Rémi Boutteau, Jean-Yves Ertaud, and Xavier Savatier. 2020. Deep learning for real-time 3D multi-object detection, localisation, and tracking: Application to smart mobility. Sensors 20, 2 (2020), 532. DOI:https://doi.org/10.3390/s20020532

[24]

Gabriel Meynet, Yana Nehmé, Julie Digne, and Guillaume Lavoué. 2020. PCQM: A full-reference quality metric for colored 3D point clouds. In Proceedings of the 12th International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 1–6.

[25]

Guan Pang and Ulrich Neumann. 2016. 3D point cloud object detection with multi-view convolutional neural network. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR). IEEE, 585–590.

[26]

Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Colored point cloud registration revisited. In Proceedings of the IEEE International Conference on Computer Vision. 143–152.

[27]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 8026–8037.

Digital Library

[28]

Charles R. Qi, Xinlei Chen, Or Litany, and Leonidas J. Guibas. 2020. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4404–4413.

[29]

Charles R. Qi, Or Litany, Kaiming He, and Leonidas J. Guibas. 2019. Deep Hough voting for 3D object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision. 9277–9286.

[30]

Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. 2018. Frustum pointnets for 3D object detection from RGB-D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 918–927.

[31]

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 652–660.

[32]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Conference on Advances in neural information processing systems. 5099–5108.

Digital Library

[33]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.

[34]

Zhile Ren and Erik B. Sudderth. 2016. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1525–1533.

[35]

Henry Schneiderman and Takeo Kanade. 2000. A statistical method for 3D object detection applied to faces and cars. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition.IEEE, 746–751.

Digital Library

[36]

Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. 2019. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–779.

[37]

Shuran Song and Jianxiong Xiao. 2016. Deep sliding shapes for amodal 3D object detection in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 808–816.

[38]

Jun Wang, Dongxiao Gu, Zeyun Yu, Changbai Tan, and Laishui Zhou. 2012. A framework for 3D model reconstruction in reverse engineering. Comput. Industr. Eng. 63, 4 (2012), 1189–1200.

Digital Library

[39]

Jun Wang, Qian Xie, Yabin Xu, Laishui Zhou, and Nan Ye. 2016. Cluttered indoor scene modeling via functional part-guided graph matching. Comput-aided Geom. Des. 43 (2016), 82–94.

Digital Library

[40]

Jun Wang, Kai Xu, Ligang Liu, Junjie Cao, Shengjun Liu, Zeyun Yu, and Xianfeng David Gu. 2013. Consolidation of low-quality point clouds from outdoor scenes. In Computer Graphics Forum, Vol. 32. Wiley Online Library, 207–216.

Digital Library

[41]

Jun Wang, Zeyun Yu, Weizhong Zhang, Mingqiang Wei, Changbai Tan, Ning Dai, and Xi Zhang. 2014. Robust reconstruction of 2D curves from scattered noisy point data. Comput.-aided Des. 50 (2014), 27–40.

Digital Library

[42]

Jun Wang, Z. Yu, W Zhu, and J. Cao. 2013. Feature-preserving surface reconstruction from unoriented, noisy point data. In Computer Graphics Forum, Vol. 32. Wiley Online Library, 164–176.

[43]

Tim Wengefeld, Benjamin Lewandowski, Daniel Seichter, Lennard Pfennig, and Horst-Michael Gross. 2019. Real-time person orientation estimation using colored pointclouds. In Proceedings of the European Conference on Mobile Robots (ECMR). IEEE, 1–7.

[44]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3–19.

Digital Library

[45]

Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. 2014. Beyond PASCAL: A benchmark for 3D object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision. IEEE, 75–82.

[46]

Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. 2020. MLVCNet: Multi-level context votenet for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10447–10456.

[47]

Qian Xie, Oussama Remil, Yanwen Guo, Meng Wang, Mingqiang Wei, and Jun Wang. 2017. Object detection and tracking under occlusion for object-level RGB-D video segmentation. IEEE Trans. Multim. 20, 3 (2017), 580–592.

Digital Library

[48]

Yi Yang, Jingkuan Song, Zi Huang, Zhigang Ma, Nicu Sebe, and Alexander G. Hauptmann. 2012. Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multim. 15, 3 (2012), 572–581.

Digital Library

[49]

Cheng Yi, Yuan Zhang, Qiaoyun Wu, Yabin Xu, Oussama Remil, Mingqiang Wei, and Jun Wang. 2017. Urban building reconstruction from raw LiDAR point data. Comput.-aided Des. 93 (2017), 1–14.

[50]

Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas J. Guibas. 2019. GSPN: Generative shape proposal network for 3D instance segmentation in point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3947–3956.

[51]

Jun Yu, Yong Rui, and Bo Chen. 2013. Exploiting click constraints and multi-view features for image re-ranking. IEEE Trans. Multim. 16, 1 (2013), 159–168.

[52]

Jun Yu, Yong Rui, and Dacheng Tao. 2014. Click prediction for web image reranking using multimodal sparse coding. IEEE Trans. Image Process. 23, 5 (2014), 2019–2032.

[53]

Jun Yu, Dacheng Tao, Meng Wang, and Yong Rui. 2014. Learning to rank using user clicks and visual features for image retrieval. IEEE Trans. Cyber. 45, 4 (2014), 767–779.

[54]

Jie Zhang, Junjie Cao, Xiuping Liu, Jun Wang, Jian Liu, and Xiquan Shi. 2013. Point cloud normal estimation via low-rank subspace clustering. Comput. Graph. 37, 6 (2013), 697–706.

Digital Library

[55]

Yin Zhou and Oncel Tuzel. 2018. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4490–4499.

Cited By

Zhou YXie ZZhao JDu WYao REl Saddik A(2024)Multi-Modal LiDAR Point Cloud Semantic Segmentation with Salience Refinement and Boundary PerceptionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367497920:10(1-20)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674979
Zhang DZhang MTan XLiu J(2024)Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness RefinementACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366182320:8(1-21)Online publication date: 12-Jun-2024
https://doi.org/10.1145/3661823
Liu JZheng ZYang ZYang Y(2024)High Fidelity Makeup via 2D and 3D Identity Preservation NetACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647520:8(1-24)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3656475
Show More Cited By

Index Terms

Multi-feature Fusion VoteNet for 3D Object Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection

Recommendations

Multi-modal feature fusion for 3D object detection in the production workshop
Abstract
3D object detection technology is of great significance to realize intelligent perception and ensure the production safety of a workshop. Existing 3D object detection relies on large-scale, high-quality 3D annotation data and is ...
Highlights
- A production workshop object dataset (PWOD) with RGB and depth image samples is established.
Geometric relation-based feature aggregation for 3D small object detection
Abstract
Point cloud-based 3D small object detection is crucial for autonomous driving and smart ships. The current 3D object detection mainly relies on object global features derived from 3D and 2D convolutional networks, inevitably leading to the loss of ...
Deep multi-scale and multi-modal fusion for 3D object detection
Highlights
- We propose a multi-scale feature fusion method from different resolution feature maps for 3D object detection.
Abstract
The perception of 3D objects in the scene is the basis of autonomous driving. Most autonomous driving cars are equipped with cameras and Lidar to obtain 3D spatial information. RGB images taken from the camera and point cloud produced ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 1

January 2022

517 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505205

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2022

Accepted: 01 April 2021

Revised: 01 March 2021

Received: 01 December 2020

Published in TOMM Volume 18, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key Research and Development Program of China
Aeronautical Science Foundation of China
National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
1,064
Total Downloads

Downloads (Last 12 months)277
Downloads (Last 6 weeks)16

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhou YXie ZZhao JDu WYao REl Saddik A(2024)Multi-Modal LiDAR Point Cloud Semantic Segmentation with Salience Refinement and Boundary PerceptionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367497920:10(1-20)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674979
Zhang DZhang MTan XLiu J(2024)Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness RefinementACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366182320:8(1-21)Online publication date: 12-Jun-2024
https://doi.org/10.1145/3661823
Liu JZheng ZYang ZYang Y(2024)High Fidelity Makeup via 2D and 3D Identity Preservation NetACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647520:8(1-24)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3656475
Ma YZhao CHuang BLi XBasu A(2024)RAST: Restorable Arbitrary Style TransferACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363877020:5(1-21)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3638770
Qiu HLi HWu QShi HWang LMeng FXu L(2024)Learning Offset Probability Distribution for Accurate Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363721420:5(1-24)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3637214
Ma ZWang SLuo XGu ZChen CLi JHua XLu G(2024)HARR: Learning Discriminative and High-Quality Hash Codes for Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362716220:5(1-23)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3627162
Gu LYan XCui PGong LXie HWang FQin JWei M(2024)PointSee: Image Enhances Point CloudIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.333177930:9(6291-6308)Online publication date: Sep-2024
https://doi.org/10.1109/TVCG.2023.3331779
Zhu ZNan LXie HChen HWang JWei MQin J(2024)CSDN: Cross-Modal Shape-Transfer Dual-Refinement Network for Point Cloud CompletionIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.323606130:7(3545-3563)Online publication date: Jul-2024
https://doi.org/10.1109/TVCG.2023.3236061
Wang RHao YHu LLi XChen MMiao YHumar I(2024)Efficient Crowd Counting via Dual Knowledge DistillationIEEE Transactions on Image Processing10.1109/TIP.2023.334360933(569-583)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2023.3343609
Gomez DSelvaraj MCasas JMathiyazhagan KRodriguez MAssefa TMlaki ANyakunga GKato FMukankusi CGirma EMosquera GArredondo VEspitia E(2024)Advancing common bean (Phaseolus vulgaris L.) disease detection with YOLO driven deep learning to enhance agricultural AIScientific Reports10.1038/s41598-024-66281-w14:1Online publication date: 6-Jul-2024
https://doi.org/10.1038/s41598-024-66281-w
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents