skip to main content
research-article

OctFormer: Octree-based Transformers for 3D Point Clouds

Published: 26 July 2023 Publication History

Abstract

We propose octree-based transformers, named OctFormer, for 3D point cloud learning. OctFormer can not only serve as a general and effective backbone for 3D point cloud segmentation and object detection but also have linear complexity and is scalable for large-scale point clouds. The key challenge in applying transformers to point clouds is reducing the quadratic, thus overwhelming, computation complexity of attentions. To combat this issue, several works divide point clouds into non-overlapping windows and constrain attentions in each local window. However, the point number in each window varies greatly, impeding the efficient execution on GPU. Observing that attentions are robust to the shapes of local windows, we propose a novel octree attention, which leverages sorted shuffled keys of octrees to partition point clouds into local windows containing a fixed number of points while permitting shapes of windows to change freely. And we also introduce dilated octree attention to expand the receptive field further. Our octree attention can be implemented in 10 lines of code with open-sourced libraries and runs 17 times faster than other point cloud attentions when the point number exceeds 200k. Built upon the octree attention, OctFormer can be easily scaled up and achieves state-of-the-art performances on a series of 3D semantic segmentation and 3D object detection benchmarks, surpassing previous sparse-voxel-based CNNs and point cloud transformers in terms of both efficiency and effectiveness. Notably, on the challenging ScanNet200 dataset, OctFormer outperforms sparse-voxel-based CNNs by 7.3 in mIoU. Our code and trained models are available at https://wang-ps.github.io/octformer.

Supplementary Material

ZIP File (repository.zip)
Octree-based Transformers for 3D Point Clouds. The repository is also available from GitHub, at https://github.com/octree-nn/octformer
MP4 File (papers_335_VOD.mp4)
presentation

References

[1]
Matan Atzmon, Haggai Maron, and Yaron Lipman. 2018. Point Convolutional Neural Networks by Extension Operators. ACM Trans. Graph. (SIGGRAPH) 37, 4 (2018).
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[3]
Jintai Chen, Biwen Lei, Qingyu Song, Haochao Ying, Danny Z Chen, and Jian Wu. 2020. A hierarchical graph network for 3D object detection on point clouds. In CVPR.
[4]
Yukang Chen, Jianhui Liu, Xiaojuan Qi, X. Zhang, Jian Sun, and Jiaya Jia. 2022. Scaling up Kernels in 3D CNNs. In NeurIPS.
[5]
Bowen Cheng, Lu Sheng, Shaoshuai Shi, Ming Yang, and Dong Xu. 2021a. Back-Tracing Representative Points for Voting-based 3D Object Detection in Point Clouds. In CVPR.
[6]
Zhang Cheng, Haocheng Wan, Xinyi Shen, and Zizhao Wu. 2021b. PatchFormer: A versatile 3D transformer based on patch attention. In CVPR.
[7]
Hung-Yueh Chiang, Yen-Liang Lin, Yueh-Cheng Liu, and Winston H Hsu. 2019. A unified point-based framework for 3D segmentation. In 3DV.
[8]
Christopher Choy, JunYoung Gwak, and Silvio Savarese. 2019. 4D spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR.
[9]
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021a. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In NeurIPS.
[10]
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021b. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021).
[11]
Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. 2020. On the relationship between self-attention and convolutional layers. In ICLR.
[12]
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In CVPR.
[13]
Angela Dai and Matthias Nießner. 2018. 3DMV: Joint 3D-multi-view prediction for 3D semantic scene segmentation. In ECCV.
[14]
Jia Deng, Wei Dong, Richard Socher, Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.
[15]
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. 2022. CSWin Transformer: A general vision transformer backbone with cross-shaped windows. In CVPR.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
[17]
Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. 2022. Embracing single stride 3D object detector with sparse transformer. In CVPR.
[18]
Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. 2018. SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels. In CVPR.
[19]
Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 2018. 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR.
[20]
Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. 2021. PCT: Point cloud transformer. Comput. Vis. Media 7, 2 (2021).
[21]
Ali Hassani and Humphrey Shi. 2022. Dilated neighborhood attention transformer. arXiv preprint arXiv:2209.15001 (2022).
[22]
Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. 2021. Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts. In CVPR.
[23]
Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. 2020a. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. CVPR.
[24]
Zeyu Hu, Mingmin Zhen, Xuyang Bai, Hongbo Fu, and Chiew-lan Tai. 2020b. JSENet: Joint semantic segmentation and edge detection network for 3d point clouds. In ECCV.
[25]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in Neural Information Processing Systems 33 (2020).
[26]
Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Panoptic feature pyramid networks. In CVPR.
[27]
Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. 2022. Stratified Transformer for 3D Point Cloud Segmentation. In CVPR.
[28]
Huan Lei, Naveed Akhtar, and Ajmal Mian. 2019. Octree guided CNN with spherical kernels for 3D point clouds. In CVPR.
[29]
Huan Lei, Naveed Akhtar, and Ajmal Mian. 2020. SegGCN: Efficient 3D point cloud segmentation with fuzzy spherical kernel. In CVPR.
[30]
Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. 2018. PointCNN: Convolution on X-transformed points. In NeurIPS.
[31]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In CVPR.
[32]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
[33]
Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. 2021b. Group-Free 3D Object Detection via Transformers. In ICCV.
[34]
Sergey Loffe and Christian Szegedy. 2015. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
[35]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR.
[36]
Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. 2022. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In ICLR.
[37]
Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. 2021. Voxel Transformer for 3D Object Detection. In ICCV.
[38]
Daniel Maturana and Sebastian Scherer. 2015. VoxNet: A 3D convolutional neural network for real-time object recognition. In IROS.
[39]
Ishan Misra, Rohit Girdhar, and Armand Joulin. 2021. An End-to-End Transformer Model for 3D Object Detection. In ICCV.
[40]
Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. 2019. PanopticFusion: Online volumetric semantic mapping at the level of stuff and things. In IROS.
[41]
Alexey Nekrasov, Jonas Schult, Or Litany, Bastian Leibe, and Francis Engelmann. 2021. Mix3D: Out-of-Context Data Augmentation for 3D Scenes. In 3DV.
[42]
Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. 2022. Masked autoencoders for point cloud self-supervised learning. In ECCV.
[43]
Chunghyun Park, Yoonwoo Jeong, Minsu Cho, and Jaesik Park. 2022. Fast Point Transformer. In CVPR.
[44]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS.
[45]
Charles R. Qi, Or Litany, Kaiming He, and Leonidas J. Guibas. 2019. Deep Hough Voting for 3D Object Detection in Point Clouds. In CVPR.
[46]
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep learning on point sets for 3D classification and segmentation. In CVPR.
[47]
Charles R. Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. 2016. Volumetric and multi-view CNNs for object classification on 3D data. In CVPR.
[48]
Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017b. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS.
[49]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.
[50]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140 (2020).
[51]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022).
[52]
Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. 2017b. OctNet-Fusion: Learning depth fusion from data. In 3DV.
[53]
Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017a. OctNet: Learning deep 3D representations at high resolutions. In CVPR.
[54]
David Rozenberszki, Or Litany, and Angela Dai. 2022. Language-Grounded Indoor 3D Semantic Segmentation in the Wild. In ECCV.
[55]
Danila Rukhovich, Anna Vorontsova, and Anton Konushin. 2022. FCAF3D: fully convolutional anchor-free 3D object detection. In ECCV.
[56]
Tianjia Shao, Yin Yang, Yanlin Weng, Qiming Hou, and Kun Zhou. 2018. H-CNN: spatial hashing based CNN for 3D shape analysis. IEEE. T. Vis. Comput. Gr. (2018).
[57]
Martin Simonovsky and Nikos Komodakis. 2017. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In CVPR.
[58]
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In CVPR.
[59]
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In ICCV.
[60]
Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. 2022. SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds. In ECCV.
[61]
Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette, and Leonidas J. Guibas. 2019. KPConv: Flexible and deformable convolution for point clouds. In ICCV.
[62]
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. 2022. MaxViT: Multi-Axis Vision Transformer. In ECCV.
[63]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
[64]
Haiyang Wang, Lihe Ding, Shaocong Dong, Shaoshuai Shi, Aoxue Li, Jianan Li, Zhenguo Li, and Liwei Wang. 2022a. CAGroup3D: Class-aware grouping for 3D object detection on point clouds. In NeurIPS.
[65]
Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. (SIGGRAPH) 36, 4 (2017).
[66]
Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong. 2018. Adaptive O-CNN: A patch-based deep representation of 3D shapes. ACM Trans. Graph. (SIGGRAPH ASIA) 37, 6 (2018).
[67]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV.
[68]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2022b. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 8, 3 (2022).
[69]
Ross Wightman. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
[70]
Jane Wilhelms and Allen Van Gelder. 1992. Octrees for faster isosurface generation. ACM Trans. Graph. 11, 3 (1992).
[71]
Wenxuan Wu, Zhongang Qi, and Li Fuxin. 2019. PointConv: Deep Convolutional Networks on 3D Point Clouds. In CVPR.
[72]
Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. 2022. Point Transformer V2: Grouped Vector Attention and Partition-based Pooling. In NeurIPS.
[73]
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shape modeling. In CVPR.
[74]
Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. 2021. Early convolutions help transformers see better. In NeurIPS.
[75]
Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Dening Lu, Mingqiang Wei, and Jun Wang. 2021. VENet: Voting Enhancement Network for 3D Object Detection. In ICCV.
[76]
Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. 2020. MLCVNet: Multi-level context votenet for 3D object detection. In CVPR.
[77]
Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. 2018. SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters. In ECCV.
[78]
Xu Yan, Chaoda Zheng, Zhen Li, Sheng Wang, and Shuguang Cui. 2020. PointASNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In CVPR.
[79]
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. 2021. Focal Self-attention for Local-Global Interactions in Vision Transformers. In NeurIPS.
[80]
Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. 2022. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In CVPR.
[81]
Feihu Zhang, Jin Fang, Benjamin Wah, and Philip Torr. 2020a. Deep FusionNet for point cloud semantic segmentation. In ECCV.
[82]
Zaiwei Zhang, Bo Sun, Haitao Yang, and Qixing Huang. 2020b. H3DNet: 3D object detection using hybrid geometric primitives. In ECCV.
[83]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun. 2021. Point transformer. In ICCV.
[84]
Kun Zhou, Minmin Gong, Xin Huang, and Baining Guo. 2011. Data-parallel octrees for surface reconstruction. IEEE. T. Vis. Comput. Gr. 17, 5 (2011).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics
ACM Transactions on Graphics  Volume 42, Issue 4
August 2023
1912 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3609020
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2023
Published in TOG Volume 42, Issue 4

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. point clouds
  2. transformers
  3. octree
  4. 3D semantic segmentation
  5. 3D object detection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)358
  • Downloads (Last 6 weeks)42
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media