skip to main content
10.1145/3664647.3681248acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation

Published: 28 October 2024 Publication History

Abstract

Object Navigation (ObjcetNav), which enables an agent to seek any instance of an object category specified by a semantic label, has shown great advances. However, current agents are built upon occlusion-prone visual observations or compressed 2D semantic maps, which hinder their embodied perception of 3D scene geometry and easily lead to ambiguous object localization and blind exploration. To address these limitations, we present an Embodied Contrastive Learning (ECL) method with Geometric Consistency (GC) and Behavioral Awareness (BA), which motivates agents to actively encode 3D scene layouts and semantic cues. Driven by our embodied exploration strategy, BA is modeled by predicting navigational actions based on multi-frame visual images, as behaviors that cause differences between adjacent visual sensations are crucial for learning correlations among continuous visions. The GC is modeled as the alignment of behavior-aware visual stimulus with 3D semantic shapes by employing unsupervised contrastive learning. The aligned behavior-aware visual features and geometric invariance priors are injected into a modular ObjectNav framework to enhance object recognition and exploration capabilities. As expected, our ECL method performs well on object detection and instance segmentation tasks. Our ObjectNav strategy outperforms state-of-the-art methods on MP3D and Gibson datasets, showing the potential of our ECL in embodied navigation.

References

[1]
Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. 2023. Bevbert: Multimodal map pre-training for language-guided navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2737--2748.
[2]
Pattaramanee Arsomngern, Sarana Nutanong, and Supasorn Suwajanakorn. 2023. Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21371--21381.
[3]
Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. 2020. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020).
[4]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650--9660.
[5]
Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, and Dhruv Batra. 2021. Semantic mapnet: Building allocentric semantic maps and representations from egocentric views. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 964--972.
[6]
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. (2017), 667--676. https://doi.org/10.1109/3DV.2017.00081
[7]
Devendra Singh Chaplot, Murtaza Dalal, Saurabh Gupta, Jitendra Malik, and Russ R Salakhutdinov. 2021. Seal: Self-supervised embodied active learning using exploration and 3d consistency. Advances in neural information processing systems, Vol. 34 (2021), 13086--13098.
[8]
Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. 2020. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, Vol. 33 (2020), 4247--4258.
[9]
Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. 2020. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12875--12884.
[10]
Bolei Chen, Yongzheng Cui, Ping Zhong, Wang Yang, Yixiong Liang, and Jianxin Wang. 2023. STExplorer: A hierarchical autonomous exploration strategy with spatio-temporal awareness for aerial robots. ACM Transactions on Intelligent Systems and Technology, Vol. 14, 6 (2023), 1--24.
[11]
Bolei Chen, Jiaxu Kang, Ping Zhong, Yongzheng Cui, Siyi Lu, Yixiong Liang, and Jianxin Wang. 2023. Think Holistically, Act Down-to-Earth: A Semantic Navigation Strategy with Continuous Environmental Representation and Multi-step Forward Planning. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[12]
Bolei Chen, Siyi Lu, Ping Zhong, Yongzheng Cui, Yixiong Liang, and Jianxin Wang. 2024. SemNav-HRO: A target-driven semantic navigation strategy with human--robot--object ternary fusion. Engineering Applications of Artificial Intelligence, Vol. 127 (2024), 107370.
[13]
Bolei Chen, Haina Zhu, Shengkang Yao, Siyi Lu, Ping Zhong, Yu Sheng, and Jianxin Wang. 2024. Socially Aware Object Goal Navigation With Heterogeneous Scene Representation Learning. IEEE Robotics and Automation Letters (2024).
[14]
Nenglun Chen, Lei Chu, Hao Pan, Yan Lu, and Wenping Wang. 2022. Self-supervised image representation learning with geometric set consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19292--19302.
[15]
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, Vol. 34 (2021), 5834--5847.
[16]
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. 2022. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16537--16547.
[17]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
[18]
Ronghao Dang, Zhuofan Shi, Liuyi Wang, Zongtao He, Chengju Liu, and Qijun Chen. 2022. Unbiased directed object attention graph for object navigation. In Proceedings of the 30th ACM International Conference on Multimedia. 3617--3627.
[19]
Ronghao Dang, Liuyi Wang, Zongtao He, Shuai Su, Jiagui Tang, Chengju Liu, and Qijun Chen. 2023. Search for or navigate to? dual adaptive thinking for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8250--8259.
[20]
Heming Du, Xin Yu, and Liang Zheng. 2020. Learning object relation graph and tentative policy for visual navigation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16. Springer, 19--34.
[21]
Heming Du, Xin Yu, and Liang Zheng. 2021. VTNet: Visual transformer network for object goal navigation. arXiv preprint arXiv:2105.09447 (2021).
[22]
Yilun Du, Chuang Gan, and Phillip Isola. 2021. Curious representation learning for embodied intelligence. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10408--10417.
[23]
Chen Gao, Xingyu Peng, Mi Yan, He Wang, Lirong Yang, Haibing Ren, Hongsheng Li, and Si Liu. 2023. Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14911--14920.
[24]
Georgios Georgakis, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, and Kostas Daniilidis. 2022. Learning to map for active semantic goal navigation. In International Conference on Learning Representations (ICLR) (2022).
[25]
Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, and Devendra Singh Chaplot. 2023. Navigating to objects in the real world. Science Robotics, Vol. 8, 79 (2023), eadf6991.
[26]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
[27]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[28]
Joao F Henriques and Andrea Vedaldi. 2018. Mapnet: An allocentric spatial memory for mapping environments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8476--8484.
[29]
Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3055--3067.
[30]
Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. 2023. Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3055--3067.
[31]
Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and Matthias Nießner. 2021. Pri3d: Can 3d priors help 2d representation learning?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5693--5702.
[32]
Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. 2022. Semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 4065--4071.
[33]
Jindong Jiang, Lunan Zheng, Fei Luo, and Zhijun Zhang. 2018. Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018).
[34]
Gulshan Kumar, N Sai Shankar, Himansu Didwania, Ruddra Dev Roychoudhury, Brojeshwar Bhowmick, and K Madhava Krishna. 2021. Gcexp: Goal-conditioned exploration for object goal navigation. In 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN). IEEE, 123--130.
[35]
Weiyuan Li, Ruoxin Hong, Jiwei Shen, Liang Yuan, and Yue Lu. 2023. Transformer Memory for Interactive Visual Navigation in Cluttered Environments. IEEE Robotics and Automation Letters, Vol. 8, 3 (2023), 1731--1738.
[36]
Weijie Li, Xinhang Song, Yubing Bai, Sixian Zhang, and Shuqiang Jiang. 2021. Ion: Instance-level object navigation. In Proceedings of the 29th ACM International Conference on Multimedia. 4343--4352.
[37]
Xiangyang Li, Zihan Wang, Jiahao Yang, Yaowei Wang, and Shuqiang Jiang. 2023. KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2583--2592.
[38]
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, and Zehuan Yuan. 2022. Multimodal transformer with variable-length memory for vision-and-language navigation. In European Conference on Computer Vision. Springer, 380--397.
[39]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V 13. Springer, 740--755.
[40]
Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. 2023. Bird's-Eye-View Scene Graph for Vision-Language Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10968--10980.
[41]
Oleksandr Maksymets, Vincent Cartillier, Aaron Gokaslan, Erik Wijmans, Wojciech Galuba, Stefan Lee, and Dhruv Batra. 2021. Thda: Treasure hunt data augmentation for semantic navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15374--15383.
[42]
Bar Mayo, Tamir Hazan, and Ayellet Tal. 2021. Visual navigation with spatial attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16898--16907.
[43]
John O'Keefe and Neil Burgess. 1996. Geometric determinants of the place fields of hippocampal neurons. Nature, Vol. 381, 6581 (1996), 425--428.
[44]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.
[45]
Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. 2020. Occupancy anticipation for efficient exploration and navigation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part V 16. Springer, 400--418.
[46]
Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. 2022. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18890--18900.
[47]
Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. 2022. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5173--5183.
[48]
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. 2019. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision. 9339--9347.
[49]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[50]
James A Sethian. 1999. Fast marching methods. SIAM review, Vol. 41, 2 (1999), 199--235.
[51]
Kunal Pratap Singh, Jordi Salvador, Luca Weihs, and Aniruddha Kembhavi. 2023. Scene Graph Contrastive Learning for Embodied Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10884--10894.
[52]
Edward C Tolman. 1948. Cognitive maps in rats and men. Psychological review, Vol. 55, 4 (1948), 189.
[53]
Hanqing Wang, Wei Liang, Luc Van Gool, and Wenguan Wang. 2023. DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10873--10883.
[54]
Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. 2020. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In International Conference on Learning Representations (ICLR) (2020).
[55]
Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. 2018. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9068--9079.
[56]
Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, and Oleksandr Maksymets. 2023. Offline visual representation learning for embodied navigation. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023.
[57]
Brian Yamauchi. 1997. A frontier-based approach for autonomous exploration. In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97.'Towards New Computational Principles for Robotics and Automation'. IEEE, 146--151.
[58]
Cheng-Kun Yang, Min-Hung Chen, Yung-Yu Chuang, and Yen-Yu Lin. 2023. 2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 977--987.
[59]
Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans. 2021. Auxiliary tasks and exploration enable objectgoal navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16117--16126.
[60]
Bangguo Yu, Hamidreza Kasaei, and Ming Cao. 2023. Frontier semantic exploration for visual target navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4099--4105.
[61]
Haitao Zeng, Xinhang Song, and Shuqiang Jiang. 2023. Multi-Object Navigation Using Potential Target Position Policy Function. IEEE Transactions on Image Processing, Vol. 32 (2023), 2608--2619.
[62]
Albert J Zhai and Shenlong Wang. 2023. Peanut: Predicting and navigating to unseen targets. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10926--10935.
[63]
Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 2023. 3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6672--6682.
[64]
Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. 2021. Hierarchical object-to-zone graph for object navigation. In Proceedings of the IEEE/CVF international conference on computer vision. 15130--15140.
[65]
Sixian Zhang, Xinhang Song, Weijie Li, Yubing Bai, Xinyao Yu, and Shuqiang Jiang. 2023. Layout-Based Causal Inference for Object Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10792--10802.
[66]
Chongyang Zhao, Yuankai Qi, and Qi Wu. 2023. Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes. In Proceedings of the 31st ACM International Conference on Multimedia. 4349--4358.
[67]
Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu. 2022. Target-driven structured transformer planner for vision-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia. 4194--4203.
[68]
Ping Zhong, Bolei Chen, Siyi Lu, Xiaoxi Meng, and Yixiong Liang. 2021. Information-driven fast marching autonomous exploration with aerial robots. IEEE Robotics and Automation Letters, Vol. 7, 2 (2021), 810--817.
[69]
Kang Zhou, Chi Guo, Wenfei Guo, and Huyin Zhang. 2023. Learning Heterogeneous Relation Graph and Value Regularization Policy for Visual Navigation. IEEE Transactions on Neural Networks and Learning Systems (2023).

Index Terms

  1. Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
      October 2024
      11719 pages
      ISBN:9798400706868
      DOI:10.1145/3664647
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. behavioral awareness
      2. contrastive representation learning
      3. embodied ai
      4. geometric consistency
      5. object navigation

      Qualifiers

      • Research-article

      Conference

      MM '24
      Sponsor:
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne VIC, Australia

      Acceptance Rates

      MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 63
        Total Downloads
      • Downloads (Last 12 months)63
      • Downloads (Last 6 weeks)42
      Reflects downloads up to 30 Dec 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media