skip to main content
10.1145/3664647.3681699acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Context-Aware Indoor Point Cloud Object Generation through User Instructions

Published: 28 October 2024 Publication History

Abstract

Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our work proposes a novel approach in scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation. The project is available at https://ainnovatelab.github.io/Context-aware-Indoor-PCG.

References

[1]
Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., and Guibas, L. J. ReferIt3D: Neural listeners for fine-grained 3d object identification in real-world scenes. In ECCV (2020).
[2]
Achlioptas, P., Diamanti, O., Mitliagkas, I., and Guibas, L. Learning representations and generative models for 3D point clouds. In Int. Conf. Mach. Learn. (10--15 Jul 2018), J. Dy and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, pp. 40--49.
[3]
Bautista, M. A., Guo, P., Abnar, S., Talbott,W., Toshev, A., Chen, Z., Dinh, L., Zhai, S., Goh, H., Ulbricht, D., et al. Gaudi: A neural architect for immersive 3d scene generation. NeurIPS 35 (2022), 25102--25116.
[4]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In NeurIPS (2020), vol. 33, pp. 1877--1901.
[5]
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015).
[6]
Chen, D. Z., Chang, A. X., and Niessner, M. Scanrefer: 3d object localization in rgb-d scans using natural language. ECCV (2020).
[7]
Chen, K., Choy, C. B., Savva, M., Chang, A. X., Funkhouser, T., and Savarese, S. Text2shape: Generating shapes from natural language by learning joint embeddings. In Asian Conf. Comput. Vis. (2019), Springer, pp. 100--116.
[8]
Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., and Chen, T. End-to-end 3d dense captioning with vote2cap-detr. In CVPR (2023), pp. 11124--11133.
[9]
Chen, Z., Gholami, A., Niessner, M., and Chang, A. X. Scan2cap: Contextaware dense captioning in rgb-d scans. In CVPR (2021), pp. 3193--3203.
[10]
Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., and Niessner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR (2017), pp. 5828--5839.
[11]
Han, M.,Wang, L., Xiao, L., Zhang, H., Zhang, C., Xu, X., and Zhu, J. Quickfps: Architecture and algorithm co-design for farthest point sampling in large-scale point clouds. IEEE Trans. Computer-Aided Des. Integr. Circuits and Syst. (2023).
[12]
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. NeurIPS 33 (2020), 6840--6851.
[13]
Ho, J., and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022).
[14]
Höllein, L., Cao, A., Owens, A., Johnson, J., and Niessner, M. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV (October 2023), pp. 7909--7920.
[15]
Huang, S., Chen, Y., Jia, J., and Wang, L. Multi-view transformer for 3d visual grounding. In CVPR (2022), pp. 15524--15533.
[16]
Huang,W., Liu, D., and Hu,W. Dense object grounding in 3d scenes. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 5017--5026.
[17]
Hui, L., Xu, R., Xie, J., Qian, J., and Yang, J. Progressive point cloud deconvolution generation network. In ECCV (2020), Springer, pp. 397--413.
[18]
Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., and Jiang, Y.-G. More: Multi-order relation mining for dense captioning in 3d scenes. In ECCV (2022), Springer, pp. 528--545.
[19]
Kamath, A., Anderson, P., Wang, S., Koh, J. Y., Ku, A., Waters, A., Yang, Y., Baldridge, J., and Parekh, Z. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In CVPR (2023), pp. 10813--10823.
[20]
Kenton, J. D. M.-W. C., and Toutanova, L. K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL (2019), pp. 4171--4186.
[21]
Kim, S., Lee, S., Hwang, D., Lee, J., Hwang, S. J., and Kim, H. J. Point cloud augmentation with weighted local transformations. In ICCV (2021), pp. 548--557.
[22]
Lim, S., Shin, M., and Paik, J. Point cloud generation using deep adversarial local features for augmented and mixed reality contents. IEEE Trans. Consum. Electron. 68, 1 (2022), 69--76.
[23]
Liu, Z., Wang, Y., Qi, X., and Fu, C.-W. Towards implicit text-guided 3d shape generation. In CVPR (2022), pp. 17896--17906.
[24]
Luo, S., and Hu, W. Diffusion probabilistic models for 3d point cloud generation. In CVPR (June 2021).
[25]
Melas-Kyriazi, L., Rupprecht, C., and Vedaldi, A. Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In CVPR (2023), pp. 12923--12932.
[26]
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99--106.
[27]
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022).
[28]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
[29]
Paschalidou, D., Kar, A., Shugrina, M., Kreis, K., Geiger, A., and Fidler, S. Atiss: Autoregressive transformers for indoor scene synthesis. NeurIPS 34 (2021), 12013--12026.
[30]
Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet: Deep hierarchical feature learning on point sets in a metric space. NeurIPS 30 (2017).
[31]
Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H., Elhoseiny, M., and Ghanem, B. Pointnext: Revisiting pointnet with improved training and scaling strategies. In NeurIPS (2022).
[32]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn. (18--24 Jul 2021), M. Meila and T. Zhang, Eds., vol. 139 of Proceedings of Machine Learning Research, pp. 8748--8763.
[33]
Ren, X., and Wang, X. Look outside the room: Synthesizing a consistent longterm 3d scene video from a single image. In CVPR (2022), pp. 3563--3573.
[34]
Ren, Y., Zhao, S., and Bingbing, L. Object insertion based data augmentation for semantic segmentation. In Int. Conf. Robot. and Automat. (2022), IEEE, pp. 359-- 365.
[35]
Ritchie, D., Wang, K., and Lin, Y.-a. Fast and flexible indoor scene synthesis via deep convolutional generative models. In CVPR (2019), pp. 6182--6190.
[36]
Rubner, Y., Tomasi, C., and Guibas, L. J. The earth mover's distance as a metric for image retrieval. IJCV 40 (2000), 99--121.
[37]
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Int. Conf. Mach. Learn. (Lille, France, 07--09 Jul 2015), F. Bach and D. Blei, Eds., vol. 37 of Proceedings of Machine Learning Research, pp. 2256--2265.
[38]
Song, L., Cao, L., Xu, H., Kang, K., Tang, F., Yuan, J., and Zhao, Y. Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. arXiv preprint arXiv:2305.11337 (2023).
[39]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In NeurIPS (2017), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.
[40]
Wang, X., Yeshwanth, C., and Niessner, M. Sceneformer: Indoor scene generation with transformers. In 3DV (2021), IEEE, pp. 106--115.
[41]
Xu, J., Wang, X., Cheng, W., Cao, Y.-P., Shan, Y., Qie, X., and Gao, S. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR (2023), pp. 20908--20918.
[42]
Xu, R., Hui, L., Han, Y., Qian, J., and Xie, J. Scene graph masked variational autoencoders for 3d scene generation. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 5725--5733.
[43]
Xu, R., Hui, L., Han, Y., Qian, J., and Xie, J. Transformer-based point cloud generation network. In Proceedings of the 31st ACM International Conference on Multimedia (2023), pp. 4169--4177.
[44]
Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., and Hariharan, B. Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV (2019), pp. 4541--4550.
[45]
Yuan, Z., Yan, X., Liao, Y., Zhang, R., Li, Z., and Cui, S. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV (2021), pp. 1791--1800.
[46]
Zhang, Y., Gong, Z., and Chang, A. X. Multi3drefer: Grounding text description to multiple 3d objects. In ICCV (2023), pp. 15225--15236.
[47]
Zhou, L., Du, Y., and Wu, J. 3d shape generation and completion through point-voxel diffusion. In CVPR (2021), pp. 5826--5835.
[48]
Zhou, Y., While, Z., and Kalogerakis, E. Scenegraphnet: Neural message passing for 3d indoor scene augmentation. In ICCV (2019), pp. 7384--7392.

Index Terms

  1. Context-Aware Indoor Point Cloud Object Generation through User Instructions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d point clouds
    2. deep learning
    3. generative model

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 58
      Total Downloads
    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 30 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media