Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Qiu, Dicong; Ma, Wenzong; Pan, Zhenfu; Xiong, Hui; Liang, Junwei

Computer Science > Robotics

arXiv:2406.18115 (cs)

[Submitted on 26 Jun 2024]

Title:Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Authors:Dicong Qiu, Wenzong Ma, Zhenfu Pan, Hui Xiong, Junwei Liang

View PDF HTML (experimental)

Abstract:Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots, especially when faced with the challenges posed by unknown and dynamic environments. This task requires robots to explore and build a semantic understanding of their surroundings, generate feasible plans to achieve manipulation goals, adapt to environmental changes, and comprehend natural language instructions from humans. To address these challenges, we propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities of pretraining visual-language models (VLMs) combined with dense 3D entity reconstruction to build 3D semantic maps. Additionally, we utilize large language models (LLMs) for spatial region abstraction and online planning, incorporating human instructions and spatial semantic context. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments that our proposed framework can effectively capture spatial semantics and process natural language user instructions for zero-shot OVMM tasks under dynamic environment settings, with an overall navigation and task success rate of 80.95% and 73.33% over 105 episodes, and better SFT and SPL by 157.18% and 19.53% respectively compared to the baseline. Furthermore, the framework is capable of replanning towards the next most probable candidate location based on the spatial semantic context derived from the 3D semantic map when initial plans fail, keeping an average success rate of 76.67%.

Comments:	Open-vocabulary, Mobile Manipulation, Dynamic Environments, 3D Semantic Maps, Zero-shot, LLMs, VLMs, 18 pages, 2 figures
Subjects:	Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.18115 [cs.RO]
	(or arXiv:2406.18115v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2406.18115

Submission history

From: Dicong Qiu [view email]
[v1] Wed, 26 Jun 2024 07:06:42 UTC (5,240 KB)

Computer Science > Robotics

Title:Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators