Fig. 1
figure 1

Motivation of visual navigation framework. The robot explores the scenes and applies large language models to find most relevant graph node (shown as circle) based on the observation and target object

1 Introduction

With the rapid development of computer vision technology, the expansion of the industry, and the continuous optimization and integration of computing resource systems applied to large-scale deep learning frameworks have emerged, such as intelligent home service robots and indoor visual navigation [12, 14, 15, 18, 19, 21, 26, 29, 30, 44]. These agents accomplish their respective functions within their environments through processes of autonomous navigation, including localization, map construction, path planning, and motion control. Visual navigation method contains these processes.

An agent, functioning as an entity that perceives and acts within its environment, makes decisions on behalf of users or programs with the aid of a planner. Having a degree of autonomy, agents execute specific, predictable, and repetitive tasks for users or applications. They encompass virtual entities such as game characters [39] and online services [6], as well as real-world counterparts like warehouse robots [32], underwater robots [42], and servo robots [17]. Navigation stands as a critical technology in the interaction process between agents and their surroundings. Furthermore, large language models are currently a prominent focus of research due to their ability to equip robots with prior knowledge, oversee the training of robot systems, and curtail training costs and expenses. This is illustrated in Fig. 1. When finding the target object "bed", the agent resorts to the LLMs to find the most relevant graph node (shown as circle) based on the observation and target object.

In recent years, the emergence of large language models (LLMs) has provided new possibilities for embedding rich semantic knowledge into robots [24, 50]. These language models can process vast amounts of natural language text and extract abundant semantic information, which can be used to guide robot behavior. By integrating with LLMs, robots can better understand human instructions and intentions, making wiser decisions when performing tasks.

Zhu [52] proposed the first visual navigation framework, laying the foundation for basic visual navigation models. It stores the target location relationships in the environment as spatial structural information of the environment, providing spatial and visual relations for known targets in unknown scenes, thereby improving the generalization ability across scenes. The navigation framework first extracts environmental observation information using pre-trained twin networks and trains each scene separately, serving as state inputs to form the mapping of the environment to actions. It serves as a precedent for indoor mapless visual navigation.

Qiu [9] proposed a knowledge graph constructed from "parent-target-subtarget," where the spatial distance between subtargets and the target to be found is designed as the reward value, encouraging the intelligent agent to explore different parent targets in the room until the target to be found is found. Obin [20] proposed a Visual Graph Memory (VGM), composed of unsupervised image representations. Using VGM, the intelligent agent can embed its navigation history and other useful task-related information, thus improving the navigation generalization ability across scenes (Fig. 2).

Visual navigation methods that typically rely on learning often require substantial computational resources to acquire and apply scene priors. For example, reinforcement learning may requires millions steps to converge, leading to high training costs. This prompts us to search for alternative methods of acquiring scene priors that do not rely on reinforcement learning. Recent related research has explored interaction-free learning and imitation learning to address this issue, resulting in significant reductions in training costs.

To address the challenges of effective environment representation and path planning, we have devised a path planning policy supervised by prior knowledge. Leveraging large language models for multi-modal information processing, we establish a knowledge graph-based clustering control policy to assist the agent in localization and path selection. By continuously detecting the semantic distance between the current cluster and the target, we update the positional information of the global graph and self-information for navigation in real-time. This model reduces training errors and can adapt to changes in different environments. The strategy represents nodes (clusters) of the knowledge graph as nodes with similar semantic proximity (e.g., remote control, TV, and sofa grouped as one node). In the update process, Graph Attention Networks (GAT) are employed to obtain useful structural information from the environment. The graph attention layer learns MMSG by recursively propagating graph embedding features from neighboring nodes to refine node embedding features, and applies attention mechanisms to distinguish the importance of target neighbors in observations. It emphasizes the detection of target-relevant features while suppressing irrelevant features in MMSG, making the extracted local features more discriminative and facilitating target localization. The proposed method demonstrates good generalization performance of MMSG in ProTHOR. In navigation performance tests, the adaptive scene-changing MMSG helps navigation agents find targets faster with higher data efficiency and success rates. Experiments have shown that the MMSG modules achieve state-of-the-art results in the ProTHOR environment. Our key contributions are summarized as follows:

  • We have designed a large-scale multi-modal knowledge graph dataset of scenes to train the task planner from pre-trained LLMs, and constructed a multi-modal dataset containing 120 indoor scenes, 35,000 instructions, and corresponding action plans.

  • We introduce two paradigms (i) Meta-Learning and (ii) Graph Attention Network approaches visual navigation framework that can make path planning without extra expensive training.

  • In our benchmark evaluation, we assessed various LLMs and LMMs for complex embodied task planning. Additionally, we conducted an ablation study to determine the optimal representation of visual scenes for generating executable actions.

2 Related work

2.1 Visual navigation

In visual navigation tasks, an agent performs actions based on visual observations to reach a given target object [11, 28, 41, 45]. Initially, the visual navigation model TD-A3C [52] proposed a method that uses ResNet to extract visual features from the environment images and map them to navigation actions. Subsequently, Yang et al. [46] guided the agent’s exploration in unknown environments by pre-building a scene knowledge graph. Zhang et al. [49] studied the parent–child relationship between large and small objects in the environment using HOZ to improve the agent’s ability in ambiguous search. Dang et al. [8] proposed an unbiased directional scene graph to eliminate the phenomenon of modality collapse between different modalities. However, these models have issues with dynamically updating prior knowledge or inefficiently utilizing prior knowledge. Although they use pre-established maps to improve the robot’s localization ability, they require constructing corresponding scene graphs for different scenes, which necessitates large-scale training to enhance scene generalization. In our model, target images and contextual semantics are fused, and a large-scale prior knowledge is used to dynamically update the scene graph, which exhibits cross-scene generalization ability in MMSG.

2.2 Large language models for visual navigation

In recent years, several works have focused on utilizing existing large-scale models to assist visual navigation. LM-Nav [37] utilizes an open-vocabulary detector to extract landmarks from the navigation environment. These landmarks are then passed to a vision-language model for localization and planning. L3MVN [48] proposes a method that uses a semantic segmentation model to calculate the entropy of objects in each frontier. This entropy is represented as query strings, and a language model is used to determine a more relevant frontier. ZSON et al. [27] identify objects from open-vocabulary categories and encoding them in a semantic embedding space. Zhou et al. [50] design an exploration method with soft com- monsense constraints (ESC) to restrict the behavior of the agent. These methods leverage common sense knowledge from pre-trained models, eliminating the need for additional priors or training. They formulate path planning to execute actions within predefined intervals but may result in suboptimal positions due to limited data. Additionally, they lack the ability to justify plans during the execution of navigation behavior. Therefore, agents need to interact and incrementally generate and update foundational plans in unknown environments. These approaches heavily rely on prompt engineering for LLMs and do not fine-tune the LLMs. In contrast, our method does not require extensive prompt engineering and directly fine-tunes LLMs for visual navigation policy.

2.3 Multi-modal large language models

Utilizing language models for detecting target categories in visual navigation requires the agent to understand modalities beyond text. In this context, we discuss approaches that involve fine-tuning language models with image-text pairs to enhance their visual capabilities. DeepSeek-VL [25] proposes a vision-language scene understanding model designed for real-world vision and language understanding applications. MiniGPT4 [51] proposes the method of fine-tuning the pre-trained LLaMA [40] model using curated image-text pairs. It utilizes the visual encoder and q-former from BLIP2 [22], adds a trainable linear layer to transform visual features into visual tokens, inserts the visual tokens and text tokens from the text prompt into LLaMA [40], and conducts the training. InstructBlip [7] extends the idea of MiniGPT4 by training language models with high-quality image-text pairs. InstructBlip collects 26 publicly available datasets covering various tasks and capabilities and converts them into an instruction tuning format for fine-tuning language models. However, these approaches heavily rely text information from LLM priors and ignore the image features that can provide the object relationships and coordinates, thus may hinder the agent’s path planning ability. Similar to MiniGPT4 and InstructBlip, our method involves creating correspondences between agent observations and actions, which we use to fine-tune language models. We consider observations from visual images and text. Text is used to represent actions, such as "turn left" or "move forward."

3 Proposed visual navigation method

In this section, we first introduce the process of constructing the multi-modal knowledge graph dataset. Following that, we describe the process of utilizing large language models to select nodes within this graph. Finally, we provide detailed explanations of associating specific task plans with visual scenes through image collection and object detection, as well as selecting navigation paths.

3.1 Multi-modal knowledge graph generation

In the visual navigation task, the agent cannot access prior knowledge about the environment (such as topological maps and 3D grid maps) or use additional sensors (such as depth cameras). The agent’s only source of data is an RGB image from an egocentric perspective, and the agent predicts its next action based on the current view and previous states. By setting a target object class, such as "toaster," the visual navigation task guides an agent to observe this class. Given an embodied 3D scene \(X_s\), we adopt the class names of all objects as the scene representation, denoted as \(X_l\) and \(X_l = [table, chair, keyboard,...]\). A common approach, as seen in the ALFRED benchmark [38], to generate multi-modal instructions for embodied task plans is to manually craft a series of instructions with corresponding step-by-step actions. However, this manual approach incurs significant annotation costs to generate complex task plans suitable for practical service robots, such as tidying up bathrooms and making sandwiches.

To efficiently generate large-scale, complex instructions \(X_q\) and corresponding executable plans \(X_a\) for a given 3D scene, we design a prompt to simulate scenarios of embodied task planning for GPT\(-\)3.5 to automatically synthesize data based on the object name list \(X_l\). As detailed in Table 5 of the supplementary materials, our prompt outlines the definition of embodied task planning, requirements, and several examples of generated instructions and corresponding action plans. Specifically, the prompt simulates a conversation between the service robot and humans to generate executable instructions and actions, mimicking robots’ exploration in embodied environments and addressing humans’ needs.

Fig. 2
figure 2

Overview of our MMSG agent using LLM. First, the intelligent agent obtains the target Region of Interest (ROI), positional encoding (PE), bounding box (bbox), and target category by multi-view object detection, along with the cosine similarity (CS) between the target object. Subsequently, it fuses node information with the target word vector using CLIP fusion, which is input to the GAT to obtain the environmental feature representation. This feature is compared with the features in the Alpaca model, and based on the user’s query and prompts template, task planning is generated to complete the task. Throughout this process, MMSG serves as grounding supervision to reduce hallucination errors

The generated instructions encompass a variety of requests, commands, and queries, with only instructions containing explicitly executable actions added to our dataset. Additionally, we stress that the target object of the generated action should be confined within the object list \(X_l\) to mitigate object hallucination leading to inexecutable plans. For the object list used in the prompt for dataset generation, we directly employ the ground truth label of existing instances in the scene.

Fig. 3
figure 3

Example of the generated multi-modal knowledge graph-based visual scenes, instructions and plans

In Fig. 3, we provide examples of the generated samples containing the object name list of the scene, instructions, and executable action steps. In embodied task planning, the agent can only access the visual scene containing all interactive objects without the ground truth object list. Therefore, we construct the multi-modal dataset by defining triplets for each sample as \(X = (X_v, X_q, X_a)\). During the training phase of the task planner, we leverage the ground truth object list for each scene to mitigate the influence of inaccurate visual perception. During the inference phase, the extended DETR [11] object detector predicts the list of all existing objects in the scene. At each time step t, the agent obtains the observation \(o_{t}\) from its monocular camera, as well as the target object \(c\in {C}\). Given the observation \(o_{t}\) and the target object c, the agent utilizes the visual navigation network to generate the policy \(\pi (a_{t}|o_{t},c)\), where \(a_{t}\) represents the action distribution at time t, and the agent selects the action with the highest probability for navigation (Figs. 4, 5, 6).

During the navigation process, the agent moves at time t and obtains the current state \(s_{t}\)’s observation value \(o_{t}\). By executing a certain action \(a_{t}\sim \pi (o_{t})\) to observation \(o_{t}\), the agent interacts with the environment to obtain the reward \(r_{t}\) for that action and transitions to the new state \(s_{t}=\nu (s_{t}, a_{t})\). The goal of each agent’s round task is to maximize the cumulative reward: \(R=\varSigma _{t=0}^{T}\gamma ^{t}r_{t}\). A is the action space, consisting of a set of discrete actions. In the navigation environment, the agent moves in the environment using six different actions, including \(A=\{MoveAhead, RotateLeft, RotateRight, LookUp, LookDown, Done\}\). Specifically, the forward MoveAhead step is 0.25 ms, and the angles for turning left RotateLeft, turning right RotateRight, looking up LookUp, and looking down LookDown are \(45^{\circ }\) and \(30^{\circ }\), respectively. A round is defined as successful if it meets the following three criteria simultaneously: (1) The agent completes the end action Done within the allowed steps; (2) the target object is within the agent’s field of view; (3) the distance between the agent and the target is less than 1.5 ms. Otherwise, the round is considered a failure.

We employ DETR as our detector to identify local image position embeddings, regions of interest (ROIs), category class labels, and cosine similarity with target objects. These features are concatenated into a node to represent the detected objects. Subsequently, a triple (object, relationship, object) is formed based on concurrent relationships. Room-level region triplets: Similar scenes (e.g., bedrooms) may contain common objects and layouts. For instance, when mentioning a living room, one might envision an area consisting of a sofa, pillows, and a table, or an area with a TV and a TV stand. When searching for targets, humans tend to first locate the typical areas where targets are most likely to appear. In a specific room, let the agent first explore the room randomly, observing a set of visual tuple features (fl), where \(f\in R^{(N\times {1})}\) represents the target bag-of-words vector obtained by DETR, indicating the targets present in the current view, with 0 and 1 representing the bag-of-words categories.

Fig. 4
figure 4

Framework of DETR object detection in our framework. It depicts the PE, ROI, bbox and class information

If the current view contains many targets belonging to the same category, only one record is made in the room-region triplet. Here, N represents the number of target object categories, and \(l={x,z,\theta _{yaw},\theta _{pitch}}\) denotes the observation position, where x and z represent horizontal coordinates, and \(\theta _{yaw}\) and \(\theta _{pitch}\) represent the agent’s yaw and pitch angles. Then, use K-means clustering on feature f to obtain K regions, forming a room-level MMSG graph \(\varOmega _i (V_i,E_i)\). Scene-level zone triplets is build to inference inherent zones. To obtain a scene-oriented MMSG graph, all room-oriented MMSG graphs are grouped by scene category. Taking one scene as an example, the room-level region set is \(\varOmega ={\varOmega _1 (V_1,E_1 ),...,\varOmega _n (V_n,E_n )}\). Since the region number K is fixed, each room’s MMSG graph has the same structure for matching and merging purposes. Utilizing these triples, we construct a multi-modal knowledge graph. Graph Attention Networks (GAT) are then employed to extract valuable graph features and adjust to changes in the scene layout.

3.2 Grounding text prompt

We generate a set of text prompts by paraphrasing them using ChatGPT. Here’s an example of a text prompt: “Picture yourself as a robot, navigating to locate \(\langle\) Goal \(\rangle\) \(\langle\) GoalHere \(\rangle\) \(\langle\) / Goal \(\rangle\) . Given the current observation \(\langle\) Img \(\rangle\) \(\langle\) ImageHere \(\rangle\) \(\langle\) / Img \(\rangle\) , and suggested action probabilities” abilities \(\langle\) ActionProb \(\rangle\) \(\langle\) ActionProbHere \(\rangle\) \(\langle\) / ActionProb \(\rangle\) , please plan out your following action.” In this text prompt, \(\langle\) GoalHere \(\rangle\) represents the object category. \(\langle\) ImageHere \(\rangle\) represents the observation tokens. We get it via text, and an example for it is “Stop with probability 0.03, move forward with probability 0.44, turn left with probability 0.28, turn right with probability 0.21, look up with probability 0.03, and look down with probability 0.01”

3.3 Evaluation metrics

The most common evaluation metrics for navigation performance are Success Rate (SR) and Success Weighted by Path Length (SPL), proposed by Anderson et al. [1]. Success Rate (SR) is defined as the ratio of the number of times the agent successfully navigates to the target to the total number of rounds. Success Weighted by Path Length (SPL) combines both the success rate and the function of the path length from the starting point to the target, aiming to drive the agent to find the target object using the fewest steps within a specified number of steps. SPL is defined as \(\frac{1}{N}\sum _{i=1}^{N}S_{i}\frac{L_{i}}{\max (P_{i},L_{i})}\), where N is the number of rounds, \(S_i\) is the indicator of success in round i, with values of 0/1 representing failure/success in finding the target object. \(P_i\) represents the path length, and \(L_i\) is the number of steps of the shortest path in round i.

The behavior of the navigation policy varies for short and long paths. In this paper, trajectories with optimal path lengths of at least 5 steps are denoted as \(L\ge 5\).

3.4 Task planning and navigation policy

3.4.1 Global policy

A knowledge graph is a semantic network that reveals relationships between entities and is widely used in tasks such as question answering, recommendation systems, and information retrieval. After obtaining the semantic knowledge graph and identifying the zones, we proceed by selecting a search window around each zone. Within these search windows, we capture all semantic objects as zone information, depicted as circles in Fig. 2. We introduce two approaches leveraging the capabilities of large language models for zone selection. Both paradigms involve summarizing the contents of a zone into a query sentence, which is then processed in the following manner:

Meta-Learning Transferred pre-trained language model that can leverage which object category most reasonable description in the query.

Graph Attention Network The query string is embedding using pre-trained language model. The, a fine-tuned graph attention network with MMSG outputs a distribution over object categories based on this embedding.

Fig. 5
figure 5

Example of meta learning approach

Fig. 6
figure 6

Example of graph attention network approach

  1. (1)

    Preprocessing for Language Models: Our objective is to utilize a sentence to encapsulate the semantic details surrounding the frontier, followed by employing a language model to evaluate the description’s quality. To gauge the correlation between a zone and a target object, we employ masked language models (MLMs) to assess the coherence and grammatical sensibility of a string W, as demonstrated in prior works [5]. For instance, the score assigned to the sentence "This zone contains sink, microwave, and mug." would be higher than that of "This zone contains sink, microwave, and bed." By evaluating the scores of strings containing common sense information, we can derive a proxy measure indicating the likelihood of the stated fact being true. To address the intricacies of the environment, we employ a method proposed in [5] to calculate the entropy of each object category. This approach helps mitigate the impact of uninformative and ubiquitous objects, such as doors and windows. Information entropy is a mathematical concept used to describe the average uncertainty of all possible information that can be generated by an information source. In detail, we tally the occurrences of each object category surrounding every target category, and then standardize the counts across targets to derive \(p(t_{j}|o_{i})\). With access to \(p(t_{j}|o_{i})\), we compute the entropy using the following formula:

    $$\begin{aligned} H_{O_{i}}=-\varSigma _{t_{j}\in L_{T}}p(t_{j}|o_{i})\log p(t_{j}|o_{i}) \end{aligned}$$

    where \(L_{T}\) is the list of target object, and H is the entropy. Entropy reaches its maximum when the distribution under consideration is uniform, and it is minimized when it becomes one-hot, indicating that more semantically informative objects result in lower corresponding entropy.

  2. (2)

    Meta-Learning Approach: When navigating to new scenes, agents may face challenges in generalizing due to limited initial experience. However, the spatial structure distribution of knowledge graphs in similar rooms of the same type exhibits similarity, which can be learned through machine learning. Meta-learning (ML) [13] has been proposed to address this implicit distribution across tasks. Meta-learning enables models to acquire the ability to learn parameter tuning, allowing them to rapidly adapt to new tasks based on existing knowledge foundations. Specifically, we construct \(|L_{T}|\) query strings for each zone \(z_{i},\) one per target object class:

    $$\begin{aligned} W_{t_{j}}^{z_{i}}=\text{``}A\quad zone\quad contains\quad o_{1},o_{2},...,o_{k},t_{j}.\text{"} \forall t_{j} \in L_{T} \end{aligned}$$
    (1)

    where \(o_{1},o_{2},...,o_{k},t_{j}\) is the detected object in the zone \(z_{i}\), and \(t_{j}\) is the target object. All queries are input into the LMM for scoring, representing the coherence of the queries, and the zone with the highest probability of relevance to the target query sentence is estimated. The score for each zone is represented as follows:

    $$\begin{aligned} S_{z_{i}}^{LLM}=\log p(W_{t_{j}}^{z_{i}}) \end{aligned}$$
    (2)
  3. (3)

    Graph Attention Network Approach: The agent utilizes GAT to assign different weights to neighboring nodes to determine whether they belong to the same adjacency layer or span across different scenes. In this way, GAT extracts features from the MMSG based on different scenes, enabling it to make more informed decisions. Specifically, we feed a single query string in the form of each zone \(z_{i}\):

    $$\begin{aligned} W^{z_{i}}_{t_{j}}=\text{``}This\quad zone\quad contains\quad o_{1},...,and\quad o_{k}.\text{ "} \end{aligned}$$
    (3)

    This string is input into LMM to produce a summary embedding vector. Finally, the embedding is fed into the fine-tuned GAT head with multi-head attention mechanism, which produces a \(|L_{T}|\)-dimensional vector of prediction logits corresponding to the target categories. The inferred zone is the one corresponding to the maximum value of the target object’s output. Then, the final score of each zone is:

    $$\begin{aligned} S_{z_{i}}^{LLM}=[f_{\theta }(Att(W^{z_{i}}))]_{t_{j}} \end{aligned}$$
    (4)

    where \(f_{\theta }: Att(W^{z_{i}}) \longrightarrow {\mathbb {R}}^{L_{T}}\) is attention embedding that takes query embedding to logits. We use 8 head attention mechanism for this mapping.

3.4.2 Local policy

To navigate from the current location of the agent to the target, we utilize the Fast Marching Method (FMM) [36]. Subsequently, the agent selects the nearest zone within a restricted range of its current position and executes the final action \(\in A\) to reach it. At each step, the local graph and local target are updated based on new observations. This approach, which employs modular policies, enhances training efficiency and avoids the need to learn obstacle avoidance from scratch, as required in end-to-end methods.

4 Experiments

4.1 Experimental setup

We experiment our MMSG navigation framework on ProcTHOR virtual indoor navigation environment. This photo-realistic environment contains four types of rooms: kitchen, living room, bathroom and bedroom. For fair comparison, we follow the main baseline SAVN [43]. Each room uses 20 scenes for training; 5 scenes for validation and 5 for testing. Besides, we also train our framework on Gibson and HM3D real-world environment. HM3D dataset is the habitat format, which we split them of 75 for training and 20 for validation scene. Gibson dataset comprises 25 training and 5 validation splits scenes. The maximum navigation step size of each episode is 100, which implies that the agent reaching the target in 100 steps for a successful episode. We train all methods until maximum convergence with of 20 million frames. For training our model, we use PyTorch framework, RMSprop for adaptation optimizer and SharedRMSprop [23] otherwise.

4.2 Implementation details

To tackle visual perceptive MMSG, we use pre-trained ResNet50 to extract observation MMSG of \(300\times 300\) features at each time step. The model uses Glove [31] to generate a 300-dimensional semantic embedding of target and graph objects, in total of 92 objects. The input of our actor-critic network is concatenated with the target object as a 300-dimensional vector, observation features, as a 1024-dimensional feature vectors and knowledge graph with \(92\times 92\) diagram. GAT is also used to make knowledge inference for producing a single value which is appended in our critic. The previous actions sample size is \(1\times 6\) and trajectory memory size is \(1\times 1024\). We concatenate these five (observation image features, target word embedding, scene layout knowledge graph, previous actions and trajectory memory) features to feed in the state encoder. Our actor-critic network consists a 512 hidden states LSTM network and two fc layers representing actor and critic. The actor outputs a 6-dimensional actions distribution \(\pi (a_{t}|x_{t})\). The critic estimates a single value using softmax. Decomposed value from GAT is fed into critic embedding, which constitutes value estimation. Especially, we novelty adapt our MMSG agent updating knowledge graph in unseen scenes and correct wrong priors in policy network.

For experimental setting, the maximum step size of each episode is 100, which implies that the agent reaching the target in 100 steps for a successful episode. The first training scenes are (\(1-20\)) in ProTHOR environment. (\(21-25\)) are test set, and the last (\(26-30\)) scenes are taken as valuation set.

Table 1 Total comparison of success rate SR (\(\%\)) and SPL (\(\%\)) on navigation performance, where bold denotes performance exceeds other models

4.3 Baseline and SOTA comparison

In order to demonstrate our main contribution and component necessity, we make a comparison with different baselines.

  1. (1)

    Random Walk The agent randomly select the action in the scenes.

  2. (2)

    Zone-based Policy [49] This baseline method employs a classical robotics pipeline for mapping and a frontier-based exploration algorithm.

  3. (3)

    SemExp [3] We follow [3] as the baseline to explore and search for the target using semantic map.

  4. (4)

    PONI [34] Potential function [3] is the newest map-based work and can be set as baseline of interaction-free learning method. We can only get the results on Gibson datasets from the published work

  5. (5)

    VGM [20] It presents a novel visual graph memory structure for VN, which uses GCN and attention mechanism to gradually accumulated graph training experience.

  6. (6)

    CLIP on Wheels (CoW) [16] propose gradient-based visualization technique on CLIP to localize the target in egocentric view, and a frontier exploration for zero-shot navigation.

  7. (7)

    L3MVN [48] It leverages Large Language Models (LLM) to impart common sense for object searching.

  8. (8)

    SemUtil [4] It builds a structured scene representation based on the SLAM and then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a target.

  9. (9)

    LOAT [24] They propose LLM-enhanced Object Affinities Transfer (LOAT) to leverage experiential object affinities for adapting new environment.

  10. (10)

    ESC [50] They present a soft common sense constraints (ESC) zero-shot object navigation method to open-world object navigation without extra experience or training.

  11. (11)

    SayPlan [35] They introduce a 3D scene graph (3DSG) representations to allow LLMs to conduct semantic search, reduce planning horizon and refine the initial plan.

  12. (12)

    VLFM [47] They build a vision-language frontier maps from depth observations for zero-shot navigation.

  13. (13)

    SayNav [33] They build 3D scene graph as an input for refining the LLM-generated Plan.

4.4 Benchmark results

We compared our MMSG model with other graph-based methods and LLM-based methods. We replaced our model and MMSG with other models for experiments conducted in the same environment. Some methods were not experimented on other datasets, denoted by "-". Experiments were conducted on four datasets, including real environment datasets Gibson, HM3D, MP3D, and virtual environment dataset ProcTHOR. Each model approximately required an average training time of 3 days. The total experimental results are summarized in Table 1. MMSG outperforms most graph-based and LLM-based methods in these benchmarks, with an improvement in \(+36.5\%\) SPL and \(+42.4\%\) success in ProcTHOR compared to VGM; \(+16.3\%\) SPL and \(+6.9\%\) success in HM3D compared to SemExp; and \(+20.28\%\) SPL and \(+47.37\%\) success in HM3D compared to ESC.

The superior performance of MMSG in the Gibson dataset can be attributed to the absence of scenarios where the robot needs to navigate stairs to reach the target, a scenario present in both HM3D and MP3D datasets. The only methods that outperform MMSG in success rate are VLFM and ESC, which are based on occupancy maps. They can traverse stairs to search for targets on different levels. However, due to the lack of a z-coordinate in the odometer provided by the agent, MMSG currently only supports single-floor scenes, making the resetting of settings and graph structures complex when changing floors. Consequently, MMSG cannot handle multi-floor scenarios.

Fig. 7
figure 7

Comparison of different models on success rate of multi target objects

As shown in Table 2, our MMSG adapts to changes in scene layouts and facilitates knowledge reasoning. The significantly improved SPL indicates that MMSG enhances navigation efficiency. Furthermore, to investigate the impact of MMSG on generalization, we train models with \(25\), \(50\), and \(75\%\) of MMSG nodes, representing randomly sampled subsets of MMSG nodes at \(25\), \(50\), and \(75\%\), respectively, along with corresponding object. Results show that a higher proportion of nodes in MMSG leads to better success rates and SPL performance. In Fig. 7, although our model performs slightly lower than the VLFM method in single-object search tasks, it demonstrates good stability in multi-object tasks. The success rate of all models decreases as the number of consecutive targets increases because if the agent fails to find an intermediate target, subsequent targets may be farther away, resulting in loss of graph reasoning or training memory guidance. However, when the number of targets increases, the performance gap between ours and SemExp is smaller, indicating that the graph updates structural information over time to adapt to changes in the number of targets. Additionally, as time progresses, graph information becomes increasingly important for the agent to explore new environments. Exploratory SLAM mapping methods are effective in the initial stages when facing unknown environments, but for semantically similar regions, these methods are prone to cumulative errors and make incorrect judgments.

Table 2 Impact of different MMSG component in ProcTHOR. It can be found that the more graph accounts, the better performance gets. L>5 means the agent that finds the target object at least 5 steps
Table 3 SR (\(\%\)), SPL (\(\%\)), and SoftSPL (SSPL \(\%\)) comparison between different modes on three datasets, where bold denotes merteic significant superior than other models

Additionally, the higher performance on the Gibson and HM3D datasets compared to the MP3D dataset can be attributed to the quality of the 3D scans. The visual fidelity of the MP3D dataset is significantly lower than that of the HM3D dataset; while, scenes from the Gibson dataset have been manually repaired and verified to be free of holes and artifacts.

4.5 Ablation study

To demonstrate the effective of scene understanding and common sense graph inference, we also design GLIP on Wheel (GoW) based on [50]. As an alternative to using graph reasoning in MMSG, GoW consistently selects the nearest frontier during navigation, with a maximum distance of 1.8 ms. It’s worth noticing that GoW shares the same navigation policy as MMSG, except for the frontier selection strategy.

Impact of scene understanding and graph inference. In Table 3 , GLIP based on open-world object detection outperforms CoW on all metrics in ProcTHOR, indicating that GLIP is more suitable for multi-modal understanding. Additionally, MMSG outperforms GoW on all datasets and metrics, demonstrating that graph-based reasoning from coarse to fine-grained levels exhibits better generalization performance compared to heuristic exploration.

Table 4 Comparison of Different Method on ProcTHOR datasets

In Fig. 7, although our model performs slightly lower than the VLFM method in single-object search tasks, it demonstrates good stability in multi-object tasks. The success rate of all models decreases as the number of consecutive targets increases because if the agent fails to find an intermediate target, subsequent targets may be farther away, resulting in loss of graph reasoning or training memory guidance. However, when the number of targets increases, the performance gap between ours and SemExp is smaller, indicating that the graph updates structural information over time to adapt to changes in the number of targets. Additionally, as time progresses, graph information becomes increasingly important for the agent to explore new environments. Exploratory SLAM mapping methods are effective in the initial stages when facing unknown environments, but for semantically similar regions, these methods are prone to cumulative errors and make incorrect judgments.

Impact of different LLMs. Comparing the inference performance of different LLMs on HM3D, we input them into GPT\(-\)3.5, LLaMa7B, and LLaVa for inference. All three LLMs significantly improve the performance over GoW. GPT performs similarly to LLaVa when using the room-prompt without specific common sense training.

Impact of MMSG. As shown in Table 2, our MMSG adapts to changes in scene layouts and facilitates knowledge reasoning. The significantly improved SPL indicates that MMSG enhances navigation efficiency. Furthermore, to investigate the impact of MMSG on generalization, we train models with \(25\), \(50\), and \(75\%\) of MMSG nodes, representing randomly sampled subsets of MMSG nodes at \(25\), \(50\), and \(75\%\), respectively, along with corresponding object. Results show that a higher proportion of nodes in MMSG leads to better success rates and SPL performance.

Fig. 8
figure 8

Visualization of testing process. With the help of MMSG, the agent can successfully infer the invisible target object

After the training process, when utilizing Graph Attention Network (GAT) embedding weight matrices for navigation, our results demonstrate that our proposed method, Multi-Modal Scene Graph (MMSG), exhibits superior robustness and faster extraction of key features. This can be attributed to the precise construction of our MMSG, in contrast to other approaches [10, 49] that solely rely on object detectors. Such detectors are prone to being affected by differences in appearance when transitioning from real to simulated environments. The effectiveness of our approach is further supported by the visualizations in Fig. 8, which highlight how the model integrates prior knowledge from the MMSG. As depicted in Fig. 8, during the search for "Dinne Table," our agent concentrate more on relative larger area, leading it to make a precise localization.

4.6 Results and discussion

The quantitative results of the comparative study are presented in Table 4. As indicated by the findings, random walking without any specialized navigation policy results in failure in nearly all episodes. However, when leveraging the map-based framework to randomly sample the long-term goal, the performance surpasses even that of the classical frontier-based method [18]. This underscores the significant advantage of the map-based approach in enabling the robot to swiftly and roughly explore the environment. Additionally, the notable improvement achieved by SemExp [2] underscores the importance of semantic information in efficient exploration. PONI [34] further enhances performance while reducing computational costs, showcasing its ability to learn semantic priors in a distinct manner from other reinforcement learning approaches. Our framework consistently outperforms all baselines across both datasets, demonstrating notable enhancements over the SemExp [2] and PONI [34] baselines. The comparison between the feed-forward and zero-shot approaches suggests that the feed-forward method learns more precise relevance in large indoor scenes.

5 Conclusion

We presented MMSG, a novel multi-modal knowledge graph-based LLM visual navigation agent that applies large language models to facilitate visual navigation by examining two paradigms that infer the semantic relevance from the observed zones. By implementing experiments in ProTHOR datasets, we demonstrate that MMSG agent can utilize the LLM for efficient visual navigation only using cameras. Our findings suggest that large language models hold immense potential in aiding robots in such tasks by providing useful knowledge. Future research should consider the design of the interaction between the robot and LLM. This purely visual solution significantly reduces the production costs of traditional home service robots. It also contributes to urban sustainable development by conserving resources and protecting the environment, while providing humans with hygienic, safe, and barrier-free green smart home services.