skip to main content
research-article

Supervoxel Convolution for Online 3D Semantic Segmentation

Published: 01 August 2021 Publication History

Abstract

Online 3D semantic segmentation, which aims to perform real-time 3D scene reconstruction along with semantic segmentation, is an important but challenging topic. A key challenge is to strike a balance between efficiency and segmentation accuracy. There are very few deep-learning-based solutions to this problem, since the commonly used deep representations based on volumetric-grids or points do not provide efficient 3D representation and organization structure for online segmentation. Observing that on-surface supervoxels, i.e., clusters of on-surface voxels, provide a compact representation of 3D surfaces and brings efficient connectivity structure via supervoxel clustering, we explore a supervoxel-based deep learning solution for this task. To this end, we contribute a novel convolution operation (SVConv) directly on supervoxels. SVConv can efficiently fuse the multi-view 2D features and 3D features projected on supervoxels during the online 3D reconstruction, and leads to an effective supervoxel-based convolutional neural network, termed as Supervoxel-CNN, enabling 2D-3D joint learning for 3D semantic prediction. With the Supervoxel-CNN, we propose a clustering-then-prediction online 3D semantic segmentation approach. The extensive evaluations on the public 3D indoor scene datasets show that our approach significantly outperforms the existing online semantic segmentation systems in terms of efficiency or accuracy.

Supplementary Material

huang (huang.zip)
Appendix, image and software files for, Supervoxel Convolution for Online 3D Semantic Segmentation
40-3-3453485-Article34 (40-3-3453485-article34.mp4)
Presentation at SIGGRAPH Asia '21

References

[1]
Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, and Matthias Nießner. 2019a. Scan2CAD: Learning CAD model alignment in RGB-D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 2614–2623.
[2]
Armen Avetisyan, Angela Dai, and Matthias Nießner. 2019b. End-to-end CAD model retrieval and 9DoF alignment in 3D scans. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19). 2551–2560.
[3]
Armen Avetisyan, Tatiana Khanova, Christopher Choy, Denver Dash, Angela Dai, and Matthias Nießner. 2020. SceneCAD: Predicting object alignments and layouts in RGB-D scans. In Proceedings of the European Conference on Computer Vision (ECCV’20) (Lecture Notes in Computer Science), Vol. 12367. 596–612.
[4]
Yan-Pei Cao, Leif Kobbelt, and Shi-Min Hu. 2018. Real-time high-accuracy three-dimensional reconstruction with consumer RGB-D cameras. ACM Trans. Graph. 37, 5 (2018), 171:1–171:16.
[5]
Dave Zhenyu Chen, Angel X. Chang, and Matthias Nießner. 2020. ScanRefer: 3D object localization in RGB-D scans using natural language. In Proceedings of the European Conference on Computer Vision (ECCV’20) (Lecture Notes in Computer Science), Vol. 12365. 202–221.
[6]
Jiawen Chen, Dennis Bautembach, and Shahram Izadi. 2013. Scalable real-time volumetric surface reconstruction. ACM Trans. Graph. 32, 4 (2013), 113:1–113:16.
[7]
Christopher B. Choy, JunYoung Gwak, and Silvio Savarese. 2019. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 3075–3084.
[8]
Brian Curless and Marc Levoy. 1996. A volumetric method for building complex models from range images. In Proceedings of the ACM Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH’96). 303–312.
[9]
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. 2017a. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2432–2443.
[10]
Angela Dai and Matthias Nießner. 2018. 3DMV: Joint 3D-multi-view prediction for 3D semantic scene segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18) (Lecture Notes in Computer Science), Vol. 11214. 458–474.
[11]
Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Christian Theobalt. 2017b. BundleFusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Trans. Graph. 36, 3 (2017), 24:1–24:18.
[12]
Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Jürgen Sturm, and Matthias Nießner. 2018. ScanComplete: Large-scale scene completion and semantic segmentation for 3D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 4578–4587.
[13]
Siyan Dong, Kai Xu, Qiang Zhou, Andrea Tagliasacchi, Shiqing Xin, Matthias Nießner, and Baoquan Chen. 2019. Multi-robot collaborative dense scene reconstruction. ACM Trans. Graph. 38, 4 (2019), 84:1–84:16.
[14]
Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 2018. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 9224–9232.
[15]
Lei Han, Tian Zheng, Lan Xu, and Lu Fang. 2020. OccuSeg: Occupancy-aware 3D instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 2937–2946.
[16]
Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or. 2019. MeshCNN: A network with an edge. ACM Trans. Graph. 38, 4 (2019), 90:1–90:12.
[17]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2980–2988.
[18]
Ji Hou, Angela Dai, and Matthias Nießner. 2019. 3D-SIS: 3D semantic instance segmentation of RGB-D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 4421–4430.
[19]
Ruizhen Hu, Cheng Wen, Oliver Van Kaick, Luanmin Chen, Di Lin, Daniel Cohen-Or, and Hui Huang. 2018. Semantic object reconstruction via casual handheld scanning. ACM Trans. Graph. 37, 6 (2018), 219:1–219:12.
[20]
Shi-Min Hu, Junxiong Cai, and Yu-Kun Lai. 2020. Semantic labeling and instance segmentation of 3D point clouds using patch context analysis and multiscale processing. IEEE Trans. Visual. Comput. Graph. 26, 7 (2020), 2485–2498.
[21]
Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. 2016. SceneNN: A scene meshes dataset with aNNotations. In Proceedings of the International Conference on 3D Vision. 92–101.
[22]
Junho Jeon, Jinwoong Jung, Jungeon Kim, and Seungyong Lee. 2018. Semantic reconstruction: Reconstruction of semantically segmented 3D meshes via volumetric semantic fusion. Comput. Graph. Forum 37, 7 (2018), 25–35.
[23]
Yiwei Jin, Diqiong Jiang, and Ming Cai. 2020. 3D reconstruction using deep learning: A survey. Commun. Info. Syst. 20, 4 (2020), 389–413.
[24]
Olaf Kähler, Victor Adrian Prisacariu, Carl Yuheng Ren, Xin Sun, Philip H. S. Torr, and David William Murray. 2015. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Visual. Comput. Graph. 21, 11 (2015), 1241–1250.
[25]
Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David A. Ross, Brian Brewington, Thomas A. Funkhouser, and Caroline Pantofaru. 2020. Virtual multi-view fusion for 3D semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’20) (Lecture Notes in Computer Science), Vol. 12369. 518–535.
[26]
Yangbin Lin, Cheng Wang, Dawei Zhai, Wei Li, and Jonathan Li. 2018. Toward better boundary preserved supervoxel segmentation for 3D point clouds. ISPRS J. Photogram. Remote Sens. 143 (2018), 39–47.
[27]
Yiqun Lin, Zizheng Yan, Haibin Huang, Dong Du, Ligang Liu, Shuguang Cui, and Xiaoguang Han. 2020. FPConv: Learning local flattening for point convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 4292–4301.
[28]
Ligang Liu, Xi Xia, Han Sun, Qi Shen, Juzhan Xu, Bin Chen, Hui Huang, and Kai Xu. 2018. Object-aware guidance for autonomous scene reconstruction. ACM Trans. Graph. 37, 4 (2018), 104:1–104:12.
[29]
Yonghua Lu, Mingmin Zhen, and Tian Fang. 2019. Multi-view-based neural network for semantic segmentation on 3D scenes. Sci. China Info. Sci. 62, 12 (2019), 229101.
[30]
John McCormac, Ronald Clark, Michael Bloesch, Andrew J. Davison, and Stefan Leutenegger. 2018. Fusion++: Volumetric object-level SLAM. In Proceedings of the International Conference on 3D Vision. 32–41.
[31]
John McCormac, Ankur Handa, Andrew J. Davison, and Stefan Leutenegger. 2017a. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’17). 4628–4635.
[32]
John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J. Davison. 2017b. SceneNet RGB-D: Can 5M synthetic images beat generic ImageNet pre-training on indoor segmentation? In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, 2697–2706.
[33]
Liangliang Nan, Ke Xie, and Andrei Sharf. 2012. A search-classify approach for cluttered indoor scene understanding. ACM Trans. Graph. 31, 6 (2012), 137:1–137:10.
[34]
Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. 2019. PanopticFusion: Online volumetric semantic mapping at the level of stuff and things. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’19). 4205–4212.
[35]
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew W. Fitzgibbon. 2011. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR’11). 127–136.
[36]
Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. 2013. Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph. 32, 6 (2013), 169:1–169:11.
[37]
Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1520–1528.
[38]
Jeremie Papon, Alexey Abramov, Markus Schoeler, and Florentin Wörgötter. 2013. Voxel cloud connectivity segmentation—Supervoxels for point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 2027–2034.
[39]
Haotian Peng, Bin Zhou, Liyuan Yin, Kan Guo, and Qinping Zhao. 2020. Semantic part segmentation of single-view point cloud. Sci. China Info. Sci. 63, 12 (2020), 224101.
[40]
Quang-Hieu Pham, Binh-Son Hua, Duc Thanh Nguyen, and Sai-Kit Yeung. 2019. Real-time progressive 3D semantic segmentation for indoor scenes. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’19). 1089–1098.
[41]
Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. 2018. Frustum pointnets for 3D object detection from RGB-D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 918–927.
[42]
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 77–85.
[43]
Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017b. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5099–5108.
[44]
Nima Sedaghat, Mohammadreza Zolfaghari, Ehsan Amiri, and Thomas Brox. 2017. Orientation-boosted voxel nets for 3D object recognition. In Proceedings of the British Machine Vision Conference (BMVC’17).
[45]
Evan Shelhamer, Jonathan Long, and Trevor Darrell. 2017. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 640–651.
[46]
Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas A. Funkhouser. 2017. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 190–198.
[47]
Duc Thanh Nguyen, Binh-Son Hua, Lap-Fai Yu, and Sai-Kit Yeung. 2018. A robust 3D-2D interactive tool for scene segmentation and annotation. IEEE Trans. Visual. Comput. Graph. 24, 12 (2018), 3005–3018.
[48]
Abhinav Valada, Rohit Mohan, and Wolfram Burgard. 2020. Self-supervised model adaptation for multimodal semantic segmentation. Int. J. Comput. Vision 128, 5 (2020), 1239–1285.
[49]
Julien P. C. Valentin, Vibhav Vineet, Ming-Ming Cheng, David Kim, Jamie Shotton, Pushmeet Kohli, Matthias Nießner, Antonio Criminisi, Shahram Izadi, and Philip H. S. Torr. 2015. SemanticPaint: Interactive 3D labeling and learning at your fingertips. ACM Trans. Graph. 34, 5 (2015), 154:1–154:17.
[50]
Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. 2019. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19). 7657–7666.
[51]
Chao Wang and Xiaohu Guo. 2017. Feature-based RGB-D camera pose optimization for real-time 3D reconstruction. Comput. Visual Media 3, 2 (2017), 95–106.
[52]
Thomas Whelan, Michael Kaess, Hordur Johannsson, Maurice F. Fallon, John J. Leonard, and John McDonald. 2015. Real-time large-scale dense RGB-D SLAM with volumetric fusion. Int. J. Robot. Res. 34, 4–5 (2015), 598–626.
[53]
Wenxuan Wu, Zhongang Qi, and Fuxin Li. 2019. PointConv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 9621–9630.
[54]
Yun-Peng Xiao, Yu-Kun Lai, Fang-Lue Zhang, Chunpeng Li, and Lin Gao. 2020. A survey on deep geometry learning: From a representation perspective. Comput. Visual Media 6, 2 (2020), 113–133.
[55]
Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 2016. 3D attention-driven depth acquisition for object identification. ACM Trans. Graph. 35, 6 (2016), 238:1–238:14.
[56]
Sheng Yang, Zheng-Fei Kuang, Yan-Pei Cao, Yu-Kun Lai, and Shi-Min Hu. 2019. Probabilistic projective association and semantic guided relocalization for dense reconstruction. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’19). 7130–7136.
[57]
Sheng Yang, Jie Xu, Kang Chen, and Hongbo Fu. 2017. View suggestion for interactive segmentation of indoor scenes. Comput. Visual Media 3, 2 (2017), 131–146.
[58]
Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. 2020. Fusion-aware point convolution for online semantic 3D scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20).
[59]
Yizhong Zhang, Weiwei Xu, Yiying Tong, and Kun Zhou. 2015. Online structure analysis for real-time indoor scene reconstruction. ACM Trans. Graph. 34, 5 (2015), 159:1–159:13.
[60]
Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 2019. 3D point capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 1009–1018.
[61]
Lintao Zheng, Chenyang Zhu, Jiazhao Zhang, Hang Zhao, Hui Huang, Matthias Nießner, and Kai Xu. 2019. Active scene understanding via online semantic reconstruction. Comput. Graph. Forum 38, 7 (2019), 103–114.
[62]
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1529–1537.

Cited By

View all

Index Terms

  1. Supervoxel Convolution for Online 3D Semantic Segmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 40, Issue 3
    June 2021
    264 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/3463476
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 August 2021
    Accepted: 01 March 2021
    Revised: 01 February 2021
    Received: 01 October 2020
    Published in TOG Volume 40, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Depth fusion
    2. semantic mapping
    3. supervoxel clustering
    4. supervoxel convolution
    5. deep learning

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Natural Science Foundation of China
    • Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology
    • Research Grants Council of the Hong Kong Special Administrative Region, China
    • City University of Hong Kong
    • China Postdoctoral Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)147
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media