research-article

Context-Aware Indoor Point Cloud Object Generation through User Instructions

Authors:

Chao GuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 10182 - 10190

https://doi.org/10.1145/3664647.3681699

Published: 28 October 2024 Publication History

Abstract

Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our work proposes a novel approach in scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation. The project is available at https://ainnovatelab.github.io/Context-aware-Indoor-PCG.

References

[1]

Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., and Guibas, L. J. ReferIt3D: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV (2020).

Digital Library

[2]

Achlioptas, P., Diamanti, O., Mitliagkas, I., and Guibas, L. Learning representations and generative models for 3D point clouds. In Int. Conf. Mach. Learn. (10--15 Jul 2018), J. Dy and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, pp. 40--49.

[3]

Bautista, M. A., Guo, P., Abnar, S., Talbott,W., Toshev, A., Chen, Z., Dinh, L., Zhai, S., Goh, H., Ulbricht, D., et al. Gaudi: A neural architect for immersive 3d scene generation. NeurIPS 35 (2022), 25102--25116.

[4]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In NeurIPS (2020), vol. 33, pp. 1877--1901.

[5]

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).

[6]

Chen, D. Z., Chang, A. X., and Niessner, M. Scanrefer: 3d object localization in rgb-d scans using natural language. ECCV (2020).

[7]

Chen, K., Choy, C. B., Savva, M., Chang, A. X., Funkhouser, T., and Savarese, S. Text2shape: Generating shapes from natural language by learning joint embeddings. In Asian Conf. Comput. Vis. (2019), Springer, pp. 100--116.

[8]

Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., and Chen, T. End-to-end 3d dense captioning with vote2cap-detr. In CVPR (2023), pp. 11124--11133.

[9]

Chen, Z., Gholami, A., Niessner, M., and Chang, A. X. Scan2cap: Contextaware dense captioning in rgb-d scans. In CVPR (2021), pp. 3193--3203.

[10]

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., and Niessner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR (2017), pp. 5828--5839.

[11]

Han, M.,Wang, L., Xiao, L., Zhang, H., Zhang, C., Xu, X., and Zhu, J. Quickfps: Architecture and algorithm co-design for farthest point sampling in large-scale point clouds. IEEE Trans. Computer-Aided Des. Integr. Circuits and Syst. (2023).

[12]

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. NeurIPS 33 (2020), 6840--6851.

[13]

Ho, J., and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).

[14]

Höllein, L., Cao, A., Owens, A., Johnson, J., and Niessner, M. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV (October 2023), pp. 7909--7920.

[15]

Huang, S., Chen, Y., Jia, J., and Wang, L. Multi-view transformer for 3d visual grounding. In CVPR (2022), pp. 15524--15533.

[16]

Huang,W., Liu, D., and Hu,W. Dense object grounding in 3d scenes. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 5017--5026.

Digital Library

[17]

Hui, L., Xu, R., Xie, J., Qian, J., and Yang, J. Progressive point cloud deconvolution generation network. In ECCV (2020), Springer, pp. 397--413.

Digital Library

[18]

Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., and Jiang, Y.-G. More: Multi-order relation mining for dense captioning in 3d scenes. In ECCV (2022), Springer, pp. 528--545.

Digital Library

[19]

Kamath, A., Anderson, P., Wang, S., Koh, J. Y., Ku, A., Waters, A., Yang, Y., Baldridge, J., and Parekh, Z. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In CVPR (2023), pp. 10813--10823.

[20]

Kenton, J. D. M.-W. C., and Toutanova, L. K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL (2019), pp. 4171--4186.

[21]

Kim, S., Lee, S., Hwang, D., Lee, J., Hwang, S. J., and Kim, H. J. Point cloud augmentation with weighted local transformations. In ICCV (2021), pp. 548--557.

[22]

Lim, S., Shin, M., and Paik, J. Point cloud generation using deep adversarial local features for augmented and mixed reality contents. IEEE Trans. Consum. Electron. 68, 1 (2022), 69--76.

[23]

Liu, Z., Wang, Y., Qi, X., and Fu, C.-W. Towards implicit text-guided 3d shape generation. In CVPR (2022), pp. 17896--17906.

[24]

Luo, S., and Hu, W. Diffusion probabilistic models for 3d point cloud generation. In CVPR (June 2021).

[25]

Melas-Kyriazi, L., Rupprecht, C., and Vedaldi, A. Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In CVPR (2023), pp. 12923--12932.

[26]

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99--106.

[27]

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022).

[28]

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

[29]

Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., and Fidler, S. Atiss: Autoregressive transformers for indoor scene synthesis. NeurIPS 34 (2021), 12013--12026.

[30]

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet: Deep hierarchical feature learning on point sets in a metric space. NeurIPS 30 (2017).

[31]

Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H., Elhoseiny, M., and Ghanem, B. Pointnext: Revisiting pointnet with improved training and scaling strategies. In NeurIPS (2022).

[32]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn. (18--24 Jul 2021), M. Meila and T. Zhang, Eds., vol. 139 of Proceedings of Machine Learning Research, pp. 8748--8763.

[33]

Ren, X., and Wang, X. Look outside the room: Synthesizing a consistent longterm 3d scene video from a single image. In CVPR (2022), pp. 3563--3573.

[34]

Ren, Y., Zhao, S., and Bingbing, L. Object insertion based data augmentation for semantic segmentation. In Int. Conf. Robot. and Automat. (2022), IEEE, pp. 359-- 365.

Digital Library

[35]

Ritchie, D., Wang, K., and Lin, Y.-a. Fast and flexible indoor scene synthesis via deep convolutional generative models. In CVPR (2019), pp. 6182--6190.

[36]

Rubner, Y., Tomasi, C., and Guibas, L. J. The earth mover's distance as a metric for image retrieval. IJCV 40 (2000), 99--121.

Digital Library

[37]

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Int. Conf. Mach. Learn. (Lille, France, 07--09 Jul 2015), F. Bach and D. Blei, Eds., vol. 37 of Proceedings of Machine Learning Research, pp. 2256--2265.

[38]

Song, L., Cao, L., Xu, H., Kang, K., Tang, F., Yuan, J., and Zhao, Y. Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023).

[39]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In NeurIPS (2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.

[40]

Wang, X., Yeshwanth, C., and Niessner, M. Sceneformer: Indoor scene generation with transformers. In 3DV (2021), IEEE, pp. 106--115.

[41]

Xu, J., Wang, X., Cheng, W., Cao, Y.-P., Shan, Y., Qie, X., and Gao, S. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR (2023), pp. 20908--20918.

[42]

Xu, R., Hui, L., Han, Y., Qian, J., and Xie, J. Scene graph masked variational autoencoders for 3d scene generation. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 5725--5733.

Digital Library

[43]

Xu, R., Hui, L., Han, Y., Qian, J., and Xie, J. Transformer-based point cloud generation network. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 4169--4177.

Digital Library

[44]

Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., and Hariharan, B. Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV (2019), pp. 4541--4550.

[45]

Yuan, Z., Yan, X., Liao, Y., Zhang, R., Li, Z., and Cui, S. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV (2021), pp. 1791--1800.

[46]

Zhang, Y., Gong, Z., and Chang, A. X. Multi3drefer: Grounding text description to multiple 3d objects. In ICCV (2023), pp. 15225--15236.

[47]

Zhou, L., Du, Y., and Wu, J. 3d shape generation and completion through point-voxel diffusion. In CVPR (2021), pp. 5826--5835.

[48]

Zhou, Y., While, Z., and Kalogerakis, E. Scenegraphnet: Neural message passing for 3d indoor scene augmentation. In ICCV (2019), pp. 7384--7392.

Index Terms

Context-Aware Indoor Point Cloud Object Generation through User Instructions
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Hybrid reality-based user experience and evaluation of a context-aware smart home

A hybrid reality-based user experience and evaluation of a context-aware smart home.Integration of context-awareness and interactive visualization to evaluate the smart home.Integration of egocentric virtual reality and exocentric augmented ...
SonifyAR: Context-Aware Sound Generation in Augmented Reality
UIST '24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology

Sound plays a crucial role in enhancing user experience and immersiveness in Augmented Reality (AR). However, current platforms lack support for AR sound authoring due to limited interaction types, challenges in collecting and specifying context ...
SonifyAR: Context-Aware Sound Effect Generation in Augmented Reality
CHI EA '24: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

Sound plays crucial roles in enhancing user experience and immersiveness in Augmented Reality (AR). However, current AR authoring platforms lack support for creating sound effects that harmonize with both the virtual and the real-world contexts. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
58
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)14

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten