Multi-Channel Convolutional Neural Network Based 3D Object Detection for Indoor Robot Environmental Perception
Abstract
:1. Introduction
2. Related Work
2.1. Voxel and Point Cloud-Based Approach
2.2. 3D Multi-View Approach
2.3. 2.5D Image Approach
3. Multi-Channel 3D Object Detection CNN
3.1. Input Data Generation for Convolutional Neural Network
3.2. 2D and 3D Proposals Generation Based on Semantic Prior
3.2.1. 3D Proposal Parameter Representation
3.2.2. Proposals Generation
3.2.3. 3D Bounding Box Regression of Objects
3.2.4. Multi-Task Loss Function
4. Experiments
4.1. Training Data Generation
4.2. Training Parameter Setting
4.3. Experimental Results and Analysis
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1–9. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D object detection from RGB-D Data. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
- Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3D voxel patterns for object category recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 8–10 June 2015; pp. 1903–1911. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D object detection for autonomous driving. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
- Chen, X.; Zhu, Y. 3D object proposals for accurate object class detection. In Proceedings of the Conference on Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 1–9. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision (ECCV 2012), Florence, Italy, 7–13 October 2012; pp. 1–14. [Google Scholar]
- Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
- Zhuang, Y.; Lin, X.Q.; Hu, H.S.; Guo, G. Using scale coordination and semantic information for robust 3-D object recognition by a service robot. IEEE Sens. J. 2015, 15, 37–47. [Google Scholar] [CrossRef]
- Lin, C.-M.; Tsai, C.-Y.; Lai, Y.-C.; Li, S.-A.; Wong, C.-C. Visual object recognition and pose estimation based on a deep semantic segmentation network. IEEE Sens. J. 2018, 18, 9370–9381. [Google Scholar] [CrossRef]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv, 2018; arXiv:1804.02767. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 1–9. [Google Scholar]
- Engelcke, M.; Rao, D.; Wang, D.; Tong, C.; Posner, I. Vote3Deep: Fast object detection in 3D point clouds using efficient convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA 2017), Singapore, Singapore, 29 May–3 June 2017; pp. 1355–1361. [Google Scholar]
- Song, S.; Xiao, J. Deep sliding shapes for amodal 3D object detection in RGB-D images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 1–10. [Google Scholar]
- Qi, C.R.; Su, H.; Nießner, M.; Dai, A.; Yan, M.; Guibas, L.J. Volumetric and multi-view CNNs for object classification on 3D data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 5648–5656. [Google Scholar]
- Enzweiler, M.; Gavrila, D.M. A multi-level mixture-of-experts framework for pedestrian classification. IEEE Trans. Image Process. 2011, 20, 2967–2979. [Google Scholar] [CrossRef] [PubMed]
- Gonzalez, A.; Vazquez, D.; Lopez, A.M.; Amores, J. On-board object detection: Multicue, multimodal, and multiview random forest of local experts. IEEE Trans. Cybern. 2017, 47, 3980–3990. [Google Scholar] [CrossRef] [PubMed]
- Lahoud, J.; Ghanem, B. 2D-driven 3D object detection in RGB-D images. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 4632–4640. [Google Scholar]
- Zhuo, D.; Latecki, L.J. Amodal detection of 3D objects: Inferring 3D bounding boxes from 2D ones in RGB-Depth images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 398–406. [Google Scholar]
- Arbelaez, P.; Pont-Tuset, J.; Barron, J.; Marques, F.; Malik, J. Multiscale combinatorial grouping. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 328–335. [Google Scholar]
- Ren, Z.; Sudderth, E.B. Three-dimensional object detection and layout prediction using clouds of oriented gradients. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 1525–1533. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Rob. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 2017, 33, 1–8. [Google Scholar] [CrossRef]
Methods | mAP | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[27] | 62.3 | 81.2 | 23.9 | 3.8 | 58.2 | 24.5 | 36.1 | 0.0 | 31.6 | 28.7 | 54.5 | 38.5 | 40.5 | 55.2 | 43.7 | 1.0 | 76.3 | 38.8 |
[34] | 36.1 | 84.5 | 40.6 | 4.9 | 46.4 | 44.8 | 33.1 | 10.2 | 44.9 | 29.4 | 60.6 | 46.3 | 58.3 | 61.8 | 43.2 | 16.3 | 79.7 | 43.6 |
Ours | 36.6 | 88.4 | 41.5 | 6.4 | 55.3 | 46.7 | 37.2 | 8.2 | 43.4 | 35.9 | 62.1 | 49.7 | 61.4 | 65.3 | 47.8 | 20.4 | 83.8 | 46.5 |
Methods | mAP | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Img+Depth | 35.8 | 84.2 | 40.5 | 4.7 | 46.9 | 43.4 | 33.0 | 10.5 | 44.6 | 28.8 | 61.1 | 45.9 | 58.5 | 61.7 | 43.6 | 15.5 | 78.3 | 43.4 |
Img+Depth+BEV | 36.6 | 88.4 | 41.5 | 6.4 | 55.3 | 46.7 | 37.2 | 8.2 | 43.4 | 35.9 | 62.1 | 49.7 | 61.4 | 65.3 | 47.8 | 20.4 | 83.8 | 46.5 |
Methods | mAP | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
[27] | 44.2 | 78.8 | 11.9 | 61.2 | 20.5 | 6.4 | 15.4 | 53.5 | 50.3 | 78.9 | 42.1 |
[36] | 58.3 | 63.7 | 31.8 | 62.2 | 45.2 | 15.5 | 27.4 | 51.0 | 51.3 | 70.1 | 47.6 |
Ours | 55.4 | 73.6 | 26.2 | 64.3 | 43.1 | 17.3 | 24.5 | 54.9 | 53.7 | 75.4 | 48.8 |
Figure 5 | 3D Bounding Box (Ground Truth) | 3D Bounding Box (Estimation) | 3D IoU | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
xgt | ygt | zgt | lgt | wgt | hgt | θgt | x | y | z | l | w | h | θ | |||
(b) | (1) | 1.20 | −0.85 | 2.36 | 1.44 | 0.72 | 0.70 | 0 | 1.24 | −0.87 | 2.28 | 1.51 | 0.80 | 0.75 | 7.6 | 0.70 |
(2) | 1.50 | 0.20 | 4.25 | 0.95 | 0.15 | 0.55 | 0 | 1.41 | 0.18 | 4.17 | 0.88 | 0.22 | 0.61 | 5.9 | 0.33 | |
(3) | 4.20 | −0.85 | 2.10 | 0.85 | 0.68 | 0.70 | −90 | 4.17 | −0.91 | 2.17 | 0.91 | 0.77 | 0.76 | −75.4 | 0.63 | |
(4) | 4.20 | −0.85 | 0.20 | 0.85 | 0.68 | 0.70 | −90 | 4.28 | −0.80 | 0.27 | 0.71 | 0.62 | 0.57 | −83.4 | 0.54 | |
(c) | (1) | −1.02 | −0.20 | 2.00 | 2.00 | 0.41 | 2.10 | 90 | −0.88 | −0.24 | 1.91 | 2.11 | 0.48 | 2.02 | 83.1 | 0.46 |
(2) | 0.75 | −0.80 | 2.30 | 1.35 | 0.75 | 0.80 | 0 | 0.71 | −0.82 | 2.27 | 1.40 | 0.80 | 0.77 | 1.8 | 0.83 | |
(3) | 3.00 | −0.80 | 2.50 | 1.92 | 0.75 | 0.80 | 0 | 2.94 | −0.83 | 2.52 | 1.95 | 0.80 | 0.87 | 2.7 | 0.80 | |
Average 3D IoU | 0.61 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, L.; Li, R.; Shi, H.; Sun, J.; Zhao, L.; Seah, H.S.; Quah, C.K.; Tandianus, B. Multi-Channel Convolutional Neural Network Based 3D Object Detection for Indoor Robot Environmental Perception. Sensors 2019, 19, 893. https://doi.org/10.3390/s19040893
Wang L, Li R, Shi H, Sun J, Zhao L, Seah HS, Quah CK, Tandianus B. Multi-Channel Convolutional Neural Network Based 3D Object Detection for Indoor Robot Environmental Perception. Sensors. 2019; 19(4):893. https://doi.org/10.3390/s19040893
Chicago/Turabian StyleWang, Li, Ruifeng Li, Hezi Shi, Jingwen Sun, Lijun Zhao, Hock Soon Seah, Chee Kwang Quah, and Budianto Tandianus. 2019. "Multi-Channel Convolutional Neural Network Based 3D Object Detection for Indoor Robot Environmental Perception" Sensors 19, no. 4: 893. https://doi.org/10.3390/s19040893
APA StyleWang, L., Li, R., Shi, H., Sun, J., Zhao, L., Seah, H. S., Quah, C. K., & Tandianus, B. (2019). Multi-Channel Convolutional Neural Network Based 3D Object Detection for Indoor Robot Environmental Perception. Sensors, 19(4), 893. https://doi.org/10.3390/s19040893