Multi-modal scene graph inspired policy for visual navigation

He, Yu; Zhou, Kang; Tian, T. Lifang

doi:10.1007/s11227-024-06541-8

Multi-modal scene graph inspired policy for visual navigation

Open access
Published: 30 October 2024

Volume 81, article number 107, (2025)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

Multi-modal scene graph inspired policy for visual navigation

Download PDF

Yu He^1,2,3,
Kang Zhou^4,5 &
T. Lifang Tian^1,2,3

728 Accesses
Explore all metrics

Abstract

Visual navigation needs the agent locate the given target with visual perception. To enable robots to effectively execute tasks, combining large language models (LLMs) with multi-modal inputs in navigation is necessary. While LLMs offer rich semantic knowledge, they lack specific real-world information and real-time interaction capabilities. This paper introduces a Multi-modal Scene Graph (MMSG) navigation framework that aligns LLMs with visual perception models to predict next steps. Firstly, a multi-modal scene dataset is constructed, containing triplets of object-relations-target words. We provide target words and lists of existing objects in the scene to generate a large number of instructions and corresponding action plans for GPT$-$3.5. The generated data is then utilized for pre-train LLM for path planning. During inference, we discover objects in the scene by extending the DETR visual object detector to multi-view RGB image collected from different reachable positions. Experimental results show that path planning generated from MMSG outperforms state-of-the-art methods, indicating its feasibility in complex environments. We evaluate our methods on the ProTHOR dataset and show superior navigation performance.

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Natural Language-Guided Semantic Navigation Using Scene Graph

Zero-Shot Object Navigation with Vision-Language Models Reasoning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid development of computer vision technology, the expansion of the industry, and the continuous optimization and integration of computing resource systems applied to large-scale deep learning frameworks have emerged, such as intelligent home service robots and indoor visual navigation [12, 14, 15, 18, 19, 21, 26, 29, 30, 44]. These agents accomplish their respective functions within their environments through processes of autonomous navigation, including localization, map construction, path planning, and motion control. Visual navigation method contains these processes.

An agent, functioning as an entity that perceives and acts within its environment, makes decisions on behalf of users or programs with the aid of a planner. Having a degree of autonomy, agents execute specific, predictable, and repetitive tasks for users or applications. They encompass virtual entities such as game characters [39] and online services [6], as well as real-world counterparts like warehouse robots [32], underwater robots [42], and servo robots [17]. Navigation stands as a critical technology in the interaction process between agents and their surroundings. Furthermore, large language models are currently a prominent focus of research due to their ability to equip robots with prior knowledge, oversee the training of robot systems, and curtail training costs and expenses. This is illustrated in Fig. 1. When finding the target object "bed", the agent resorts to the LLMs to find the most relevant graph node (shown as circle) based on the observation and target object.

In recent years, the emergence of large language models (LLMs) has provided new possibilities for embedding rich semantic knowledge into robots [24, 50]. These language models can process vast amounts of natural language text and extract abundant semantic information, which can be used to guide robot behavior. By integrating with LLMs, robots can better understand human instructions and intentions, making wiser decisions when performing tasks.

Zhu [52] proposed the first visual navigation framework, laying the foundation for basic visual navigation models. It stores the target location relationships in the environment as spatial structural information of the environment, providing spatial and visual relations for known targets in unknown scenes, thereby improving the generalization ability across scenes. The navigation framework first extracts environmental observation information using pre-trained twin networks and trains each scene separately, serving as state inputs to form the mapping of the environment to actions. It serves as a precedent for indoor mapless visual navigation.

Qiu [9] proposed a knowledge graph constructed from "parent-target-subtarget," where the spatial distance between subtargets and the target to be found is designed as the reward value, encouraging the intelligent agent to explore different parent targets in the room until the target to be found is found. Obin [20] proposed a Visual Graph Memory (VGM), composed of unsupervised image representations. Using VGM, the intelligent agent can embed its navigation history and other useful task-related information, thus improving the navigation generalization ability across scenes (Fig. 2).

Visual navigation methods that typically rely on learning often require substantial computational resources to acquire and apply scene priors. For example, reinforcement learning may requires millions steps to converge, leading to high training costs. This prompts us to search for alternative methods of acquiring scene priors that do not rely on reinforcement learning. Recent related research has explored interaction-free learning and imitation learning to address this issue, resulting in significant reductions in training costs.

To address the challenges of effective environment representation and path planning, we have devised a path planning policy supervised by prior knowledge. Leveraging large language models for multi-modal information processing, we establish a knowledge graph-based clustering control policy to assist the agent in localization and path selection. By continuously detecting the semantic distance between the current cluster and the target, we update the positional information of the global graph and self-information for navigation in real-time. This model reduces training errors and can adapt to changes in different environments. The strategy represents nodes (clusters) of the knowledge graph as nodes with similar semantic proximity (e.g., remote control, TV, and sofa grouped as one node). In the update process, Graph Attention Networks (GAT) are employed to obtain useful structural information from the environment. The graph attention layer learns MMSG by recursively propagating graph embedding features from neighboring nodes to refine node embedding features, and applies attention mechanisms to distinguish the importance of target neighbors in observations. It emphasizes the detection of target-relevant features while suppressing irrelevant features in MMSG, making the extracted local features more discriminative and facilitating target localization. The proposed method demonstrates good generalization performance of MMSG in ProTHOR. In navigation performance tests, the adaptive scene-changing MMSG helps navigation agents find targets faster with higher data efficiency and success rates. Experiments have shown that the MMSG modules achieve state-of-the-art results in the ProTHOR environment. Our key contributions are summarized as follows:

We have designed a large-scale multi-modal knowledge graph dataset of scenes to train the task planner from pre-trained LLMs, and constructed a multi-modal dataset containing 120 indoor scenes, 35,000 instructions, and corresponding action plans.
We introduce two paradigms (i) Meta-Learning and (ii) Graph Attention Network approaches visual navigation framework that can make path planning without extra expensive training.
In our benchmark evaluation, we assessed various LLMs and LMMs for complex embodied task planning. Additionally, we conducted an ablation study to determine the optimal representation of visual scenes for generating executable actions.

2 Related work

2.1 Visual navigation

In visual navigation tasks, an agent performs actions based on visual observations to reach a given target object [11, 28, 41, 45]. Initially, the visual navigation model TD-A3C [52] proposed a method that uses ResNet to extract visual features from the environment images and map them to navigation actions. Subsequently, Yang et al. [46] guided the agent’s exploration in unknown environments by pre-building a scene knowledge graph. Zhang et al. [49] studied the parent–child relationship between large and small objects in the environment using HOZ to improve the agent’s ability in ambiguous search. Dang et al. [8] proposed an unbiased directional scene graph to eliminate the phenomenon of modality collapse between different modalities. However, these models have issues with dynamically updating prior knowledge or inefficiently utilizing prior knowledge. Although they use pre-established maps to improve the robot’s localization ability, they require constructing corresponding scene graphs for different scenes, which necessitates large-scale training to enhance scene generalization. In our model, target images and contextual semantics are fused, and a large-scale prior knowledge is used to dynamically update the scene graph, which exhibits cross-scene generalization ability in MMSG.

2.2 Large language models for visual navigation

In recent years, several works have focused on utilizing existing large-scale models to assist visual navigation. LM-Nav [37] utilizes an open-vocabulary detector to extract landmarks from the navigation environment. These landmarks are then passed to a vision-language model for localization and planning. L3MVN [48] proposes a method that uses a semantic segmentation model to calculate the entropy of objects in each frontier. This entropy is represented as query strings, and a language model is used to determine a more relevant frontier. ZSON et al. [27] identify objects from open-vocabulary categories and encoding them in a semantic embedding space. Zhou et al. [50] design an exploration method with soft com- monsense constraints (ESC) to restrict the behavior of the agent. These methods leverage common sense knowledge from pre-trained models, eliminating the need for additional priors or training. They formulate path planning to execute actions within predefined intervals but may result in suboptimal positions due to limited data. Additionally, they lack the ability to justify plans during the execution of navigation behavior. Therefore, agents need to interact and incrementally generate and update foundational plans in unknown environments. These approaches heavily rely on prompt engineering for LLMs and do not fine-tune the LLMs. In contrast, our method does not require extensive prompt engineering and directly fine-tunes LLMs for visual navigation policy.

2.3 Multi-modal large language models

Utilizing language models for detecting target categories in visual navigation requires the agent to understand modalities beyond text. In this context, we discuss approaches that involve fine-tuning language models with image-text pairs to enhance their visual capabilities. DeepSeek-VL [25] proposes a vision-language scene understanding model designed for real-world vision and language understanding applications. MiniGPT4 [51] proposes the method of fine-tuning the pre-trained LLaMA [40] model using curated image-text pairs. It utilizes the visual encoder and q-former from BLIP2 [22], adds a trainable linear layer to transform visual features into visual tokens, inserts the visual tokens and text tokens from the text prompt into LLaMA [40], and conducts the training. InstructBlip [7] extends the idea of MiniGPT4 by training language models with high-quality image-text pairs. InstructBlip collects 26 publicly available datasets covering various tasks and capabilities and converts them into an instruction tuning format for fine-tuning language models. However, these approaches heavily rely text information from LLM priors and ignore the image features that can provide the object relationships and coordinates, thus may hinder the agent’s path planning ability. Similar to MiniGPT4 and InstructBlip, our method involves creating correspondences between agent observations and actions, which we use to fine-tune language models. We consider observations from visual images and text. Text is used to represent actions, such as "turn left" or "move forward."

3 Proposed visual navigation method

In this section, we first introduce the process of constructing the multi-modal knowledge graph dataset. Following that, we describe the process of utilizing large language models to select nodes within this graph. Finally, we provide detailed explanations of associating specific task plans with visual scenes through image collection and object detection, as well as selecting navigation paths.

3.1 Multi-modal knowledge graph generation

In the visual navigation task, the agent cannot access prior knowledge about the environment (such as topological maps and 3D grid maps) or use additional sensors (such as depth cameras). The agent’s only source of data is an RGB image from an egocentric perspective, and the agent predicts its next action based on the current view and previous states. By setting a target object class, such as "toaster," the visual navigation task guides an agent to observe this class. Given an embodied 3D scene $X_s$, we adopt the class names of all objects as the scene representation, denoted as $X_l$ and $X_l = [table, chair, keyboard,...]$. A common approach, as seen in the ALFRED benchmark [38], to generate multi-modal instructions for embodied task plans is to manually craft a series of instructions with corresponding step-by-step actions. However, this manual approach incurs significant annotation costs to generate complex task plans suitable for practical service robots, such as tidying up bathrooms and making sandwiches.

To efficiently generate large-scale, complex instructions $X_q$ and corresponding executable plans $X_a$ for a given 3D scene, we design a prompt to simulate scenarios of embodied task planning for GPT$-$3.5 to automatically synthesize data based on the object name list $X_l$. As detailed in Table 5 of the supplementary materials, our prompt outlines the definition of embodied task planning, requirements, and several examples of generated instructions and corresponding action plans. Specifically, the prompt simulates a conversation between the service robot and humans to generate executable instructions and actions, mimicking robots’ exploration in embodied environments and addressing humans’ needs.

The generated instructions encompass a variety of requests, commands, and queries, with only instructions containing explicitly executable actions added to our dataset. Additionally, we stress that the target object of the generated action should be confined within the object list $X_l$ to mitigate object hallucination leading to inexecutable plans. For the object list used in the prompt for dataset generation, we directly employ the ground truth label of existing instances in the scene.

In Fig. 3, we provide examples of the generated samples containing the object name list of the scene, instructions, and executable action steps. In embodied task planning, the agent can only access the visual scene containing all interactive objects without the ground truth object list. Therefore, we construct the multi-modal dataset by defining triplets for each sample as $X = (X_v, X_q, X_a)$. During the training phase of the task planner, we leverage the ground truth object list for each scene to mitigate the influence of inaccurate visual perception. During the inference phase, the extended DETR [11] object detector predicts the list of all existing objects in the scene. At each time step t, the agent obtains the observation $o_{t}$ from its monocular camera, as well as the target object $c\in {C}$. Given the observation $o_{t}$ and the target object c, the agent utilizes the visual navigation network to generate the policy $\pi (a_{t}|o_{t},c)$, where $a_{t}$ represents the action distribution at time t, and the agent selects the action with the highest probability for navigation (Figs. 4, 5, 6).

During the navigation process, the agent moves at time t and obtains the current state $s_{t}$’s observation value $o_{t}$. By executing a certain action $a_{t}\sim \pi (o_{t})$ to observation $o_{t}$, the agent interacts with the environment to obtain the reward $r_{t}$ for that action and transitions to the new state $s_{t}=\nu (s_{t}, a_{t})$. The goal of each agent’s round task is to maximize the cumulative reward: $R=\varSigma _{t=0}^{T}\gamma ^{t}r_{t}$. A is the action space, consisting of a set of discrete actions. In the navigation environment, the agent moves in the environment using six different actions, including $A=\{MoveAhead, RotateLeft, RotateRight, LookUp, LookDown, Done\}$. Specifically, the forward MoveAhead step is 0.25 ms, and the angles for turning left RotateLeft, turning right RotateRight, looking up LookUp, and looking down LookDown are $45^{\circ }$ and $30^{\circ }$, respectively. A round is defined as successful if it meets the following three criteria simultaneously: (1) The agent completes the end action Done within the allowed steps; (2) the target object is within the agent’s field of view; (3) the distance between the agent and the target is less than 1.5 ms. Otherwise, the round is considered a failure.

We employ DETR as our detector to identify local image position embeddings, regions of interest (ROIs), category class labels, and cosine similarity with target objects. These features are concatenated into a node to represent the detected objects. Subsequently, a triple (object, relationship, object) is formed based on concurrent relationships. Room-level region triplets: Similar scenes (e.g., bedrooms) may contain common objects and layouts. For instance, when mentioning a living room, one might envision an area consisting of a sofa, pillows, and a table, or an area with a TV and a TV stand. When searching for targets, humans tend to first locate the typical areas where targets are most likely to appear. In a specific room, let the agent first explore the room randomly, observing a set of visual tuple features (f, l), where $f\in R^{(N\times {1})}$ represents the target bag-of-words vector obtained by DETR, indicating the targets present in the current view, with 0 and 1 representing the bag-of-words categories.

If the current view contains many targets belonging to the same category, only one record is made in the room-region triplet. Here, N represents the number of target object categories, and $l={x,z,\theta _{yaw},\theta _{pitch}}$ denotes the observation position, where x and z represent horizontal coordinates, and $\theta _{yaw}$ and $\theta _{pitch}$ represent the agent’s yaw and pitch angles. Then, use K-means clustering on feature f to obtain K regions, forming a room-level MMSG graph $\varOmega _i (V_i,E_i)$. Scene-level zone triplets is build to inference inherent zones. To obtain a scene-oriented MMSG graph, all room-oriented MMSG graphs are grouped by scene category. Taking one scene as an example, the room-level region set is $\varOmega ={\varOmega _1 (V_1,E_1 ),...,\varOmega _n (V_n,E_n )}$. Since the region number K is fixed, each room’s MMSG graph has the same structure for matching and merging purposes. Utilizing these triples, we construct a multi-modal knowledge graph. Graph Attention Networks (GAT) are then employed to extract valuable graph features and adjust to changes in the scene layout.

3.2 Grounding text prompt

We generate a set of text prompts by paraphrasing them using ChatGPT. Here’s an example of a text prompt: “Picture yourself as a robot, navigating to locate $\langle$ Goal $\rangle$ $\langle$ GoalHere $\rangle$ $\langle$ / Goal $\rangle$ . Given the current observation $\langle$ Img $\rangle$ $\langle$ ImageHere $\rangle$ $\langle$ / Img $\rangle$ , and suggested action probabilities” abilities $\langle$ ActionProb $\rangle$ $\langle$ ActionProbHere $\rangle$ $\langle$ / ActionProb $\rangle$ , please plan out your following action.” In this text prompt, $\langle$ GoalHere $\rangle$ represents the object category. $\langle$ ImageHere $\rangle$ represents the observation tokens. We get it via text, and an example for it is “Stop with probability 0.03, move forward with probability 0.44, turn left with probability 0.28, turn right with probability 0.21, look up with probability 0.03, and look down with probability 0.01”

3.3 Evaluation metrics

The most common evaluation metrics for navigation performance are Success Rate (SR) and Success Weighted by Path Length (SPL), proposed by Anderson et al. [1]. Success Rate (SR) is defined as the ratio of the number of times the agent successfully navigates to the target to the total number of rounds. Success Weighted by Path Length (SPL) combines both the success rate and the function of the path length from the starting point to the target, aiming to drive the agent to find the target object using the fewest steps within a specified number of steps. SPL is defined as $\frac{1}{N}\sum _{i=1}^{N}S_{i}\frac{L_{i}}{\max (P_{i},L_{i})}$, where N is the number of rounds, $S_i$ is the indicator of success in round i, with values of 0/1 representing failure/success in finding the target object. $P_i$ represents the path length, and $L_i$ is the number of steps of the shortest path in round i.

The behavior of the navigation policy varies for short and long paths. In this paper, trajectories with optimal path lengths of at least 5 steps are denoted as $L\ge 5$.

3.4 Task planning and navigation policy

3.4.1 Global policy

A knowledge graph is a semantic network that reveals relationships between entities and is widely used in tasks such as question answering, recommendation systems, and information retrieval. After obtaining the semantic knowledge graph and identifying the zones, we proceed by selecting a search window around each zone. Within these search windows, we capture all semantic objects as zone information, depicted as circles in Fig. 2. We introduce two approaches leveraging the capabilities of large language models for zone selection. Both paradigms involve summarizing the contents of a zone into a query sentence, which is then processed in the following manner:

Meta-Learning Transferred pre-trained language model that can leverage which object category most reasonable description in the query.

Graph Attention Network The query string is embedding using pre-trained language model. The, a fine-tuned graph attention network with MMSG outputs a distribution over object categories based on this embedding.

(1)
Preprocessing for Language Models: Our objective is to utilize a sentence to encapsulate the semantic details surrounding the frontier, followed by employing a language model to evaluate the description’s quality. To gauge the correlation between a zone and a target object, we employ masked language models (MLMs) to assess the coherence and grammatical sensibility of a string W, as demonstrated in prior works [5]. For instance, the score assigned to the sentence "This zone contains sink, microwave, and mug." would be higher than that of "This zone contains sink, microwave, and bed." By evaluating the scores of strings containing common sense information, we can derive a proxy measure indicating the likelihood of the stated fact being true. To address the intricacies of the environment, we employ a method proposed in [5] to calculate the entropy of each object category. This approach helps mitigate the impact of uninformative and ubiquitous objects, such as doors and windows. Information entropy is a mathematical concept used to describe the average uncertainty of all possible information that can be generated by an information source. In detail, we tally the occurrences of each object category surrounding every target category, and then standardize the counts across targets to derive $p(t_{j}|o_{i})$. With access to $p(t_{j}|o_{i})$, we compute the entropy using the following formula:
$$\begin{aligned} H_{O_{i}}=-\varSigma _{t_{j}\in L_{T}}p(t_{j}|o_{i})\log p(t_{j}|o_{i}) \end{aligned}$$
where $L_{T}$ is the list of target object, and H is the entropy. Entropy reaches its maximum when the distribution under consideration is uniform, and it is minimized when it becomes one-hot, indicating that more semantically informative objects result in lower corresponding entropy.
(2)
Meta-Learning Approach: When navigating to new scenes, agents may face challenges in generalizing due to limited initial experience. However, the spatial structure distribution of knowledge graphs in similar rooms of the same type exhibits similarity, which can be learned through machine learning. Meta-learning (ML) [13] has been proposed to address this implicit distribution across tasks. Meta-learning enables models to acquire the ability to learn parameter tuning, allowing them to rapidly adapt to new tasks based on existing knowledge foundations. Specifically, we construct $|L_{T}|$ query strings for each zone $z_{i},$ one per target object class:
$$\begin{aligned} W_{t_{j}}^{z_{i}}=\text{``}A\quad zone\quad contains\quad o_{1},o_{2},...,o_{k},t_{j}.\text{"} \forall t_{j} \in L_{T} \end{aligned}$$
(1)
where $o_{1},o_{2},...,o_{k},t_{j}$ is the detected object in the zone $z_{i}$, and $t_{j}$ is the target object. All queries are input into the LMM for scoring, representing the coherence of the queries, and the zone with the highest probability of relevance to the target query sentence is estimated. The score for each zone is represented as follows:
$$\begin{aligned} S_{z_{i}}^{LLM}=\log p(W_{t_{j}}^{z_{i}}) \end{aligned}$$
(2)
(3)
Graph Attention Network Approach: The agent utilizes GAT to assign different weights to neighboring nodes to determine whether they belong to the same adjacency layer or span across different scenes. In this way, GAT extracts features from the MMSG based on different scenes, enabling it to make more informed decisions. Specifically, we feed a single query string in the form of each zone $z_{i}$:
$$\begin{aligned} W^{z_{i}}_{t_{j}}=\text{``}This\quad zone\quad contains\quad o_{1},...,and\quad o_{k}.\text{ "} \end{aligned}$$
(3)
This string is input into LMM to produce a summary embedding vector. Finally, the embedding is fed into the fine-tuned GAT head with multi-head attention mechanism, which produces a $|L_{T}|$-dimensional vector of prediction logits corresponding to the target categories. The inferred zone is the one corresponding to the maximum value of the target object’s output. Then, the final score of each zone is:
$$\begin{aligned} S_{z_{i}}^{LLM}=[f_{\theta }(Att(W^{z_{i}}))]_{t_{j}} \end{aligned}$$
(4)
where $f_{\theta }: Att(W^{z_{i}}) \longrightarrow {\mathbb {R}}^{L_{T}}$ is attention embedding that takes query embedding to logits. We use 8 head attention mechanism for this mapping.

3.4.2 Local policy

To navigate from the current location of the agent to the target, we utilize the Fast Marching Method (FMM) [36]. Subsequently, the agent selects the nearest zone within a restricted range of its current position and executes the final action $\in A$ to reach it. At each step, the local graph and local target are updated based on new observations. This approach, which employs modular policies, enhances training efficiency and avoids the need to learn obstacle avoidance from scratch, as required in end-to-end methods.

4 Experiments

4.1 Experimental setup

We experiment our MMSG navigation framework on ProcTHOR virtual indoor navigation environment. This photo-realistic environment contains four types of rooms: kitchen, living room, bathroom and bedroom. For fair comparison, we follow the main baseline SAVN [43]. Each room uses 20 scenes for training; 5 scenes for validation and 5 for testing. Besides, we also train our framework on Gibson and HM3D real-world environment. HM3D dataset is the habitat format, which we split them of 75 for training and 20 for validation scene. Gibson dataset comprises 25 training and 5 validation splits scenes. The maximum navigation step size of each episode is 100, which implies that the agent reaching the target in 100 steps for a successful episode. We train all methods until maximum convergence with of 20 million frames. For training our model, we use PyTorch framework, RMSprop for adaptation optimizer and SharedRMSprop [23] otherwise.

4.2 Implementation details

To tackle visual perceptive MMSG, we use pre-trained ResNet50 to extract observation MMSG of $300\times 300$ features at each time step. The model uses Glove [31] to generate a 300-dimensional semantic embedding of target and graph objects, in total of 92 objects. The input of our actor-critic network is concatenated with the target object as a 300-dimensional vector, observation features, as a 1024-dimensional feature vectors and knowledge graph with $92\times 92$ diagram. GAT is also used to make knowledge inference for producing a single value which is appended in our critic. The previous actions sample size is $1\times 6$ and trajectory memory size is $1\times 1024$. We concatenate these five (observation image features, target word embedding, scene layout knowledge graph, previous actions and trajectory memory) features to feed in the state encoder. Our actor-critic network consists a 512 hidden states LSTM network and two fc layers representing actor and critic. The actor outputs a 6-dimensional actions distribution $\pi (a_{t}|x_{t})$. The critic estimates a single value using softmax. Decomposed value from GAT is fed into critic embedding, which constitutes value estimation. Especially, we novelty adapt our MMSG agent updating knowledge graph in unseen scenes and correct wrong priors in policy network.

For experimental setting, the maximum step size of each episode is 100, which implies that the agent reaching the target in 100 steps for a successful episode. The first training scenes are ($1-20$) in ProTHOR environment. ($21-25$) are test set, and the last ($26-30$) scenes are taken as valuation set.

Table 1 Total comparison of success rate SR ($\%$) and SPL ($\%$) on navigation performance, where bold denotes performance exceeds other models

Full size table

4.3 Baseline and SOTA comparison

In order to demonstrate our main contribution and component necessity, we make a comparison with different baselines.

(1)
Random Walk The agent randomly select the action in the scenes.
(2)
Zone-based Policy [49] This baseline method employs a classical robotics pipeline for mapping and a frontier-based exploration algorithm.
(3)
SemExp [3] We follow [3] as the baseline to explore and search for the target using semantic map.
(4)
PONI [34] Potential function [3] is the newest map-based work and can be set as baseline of interaction-free learning method. We can only get the results on Gibson datasets from the published work
(5)
VGM [20] It presents a novel visual graph memory structure for VN, which uses GCN and attention mechanism to gradually accumulated graph training experience.
(6)
CLIP on Wheels (CoW) [16] propose gradient-based visualization technique on CLIP to localize the target in egocentric view, and a frontier exploration for zero-shot navigation.
(7)
L3MVN [48] It leverages Large Language Models (LLM) to impart common sense for object searching.
(8)
SemUtil [4] It builds a structured scene representation based on the SLAM and then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a target.
(9)
LOAT [24] They propose LLM-enhanced Object Affinities Transfer (LOAT) to leverage experiential object affinities for adapting new environment.
(10)
ESC [50] They present a soft common sense constraints (ESC) zero-shot object navigation method to open-world object navigation without extra experience or training.
(11)
SayPlan [35] They introduce a 3D scene graph (3DSG) representations to allow LLMs to conduct semantic search, reduce planning horizon and refine the initial plan.
(12)
VLFM [47] They build a vision-language frontier maps from depth observations for zero-shot navigation.
(13)
SayNav [33] They build 3D scene graph as an input for refining the LLM-generated Plan.

4.4 Benchmark results

We compared our MMSG model with other graph-based methods and LLM-based methods. We replaced our model and MMSG with other models for experiments conducted in the same environment. Some methods were not experimented on other datasets, denoted by "-". Experiments were conducted on four datasets, including real environment datasets Gibson, HM3D, MP3D, and virtual environment dataset ProcTHOR. Each model approximately required an average training time of 3 days. The total experimental results are summarized in Table 1. MMSG outperforms most graph-based and LLM-based methods in these benchmarks, with an improvement in $+36.5\%$ SPL and $+42.4\%$ success in ProcTHOR compared to VGM; $+16.3\%$ SPL and $+6.9\%$ success in HM3D compared to SemExp; and $+20.28\%$ SPL and $+47.37\%$ success in HM3D compared to ESC.

The superior performance of MMSG in the Gibson dataset can be attributed to the absence of scenarios where the robot needs to navigate stairs to reach the target, a scenario present in both HM3D and MP3D datasets. The only methods that outperform MMSG in success rate are VLFM and ESC, which are based on occupancy maps. They can traverse stairs to search for targets on different levels. However, due to the lack of a z-coordinate in the odometer provided by the agent, MMSG currently only supports single-floor scenes, making the resetting of settings and graph structures complex when changing floors. Consequently, MMSG cannot handle multi-floor scenarios.

As shown in Table 2, our MMSG adapts to changes in scene layouts and facilitates knowledge reasoning. The significantly improved SPL indicates that MMSG enhances navigation efficiency. Furthermore, to investigate the impact of MMSG on generalization, we train models with $25$, $50$, and $75\%$ of MMSG nodes, representing randomly sampled subsets of MMSG nodes at $25$, $50$, and $75\%$, respectively, along with corresponding object. Results show that a higher proportion of nodes in MMSG leads to better success rates and SPL performance. In Fig. 7, although our model performs slightly lower than the VLFM method in single-object search tasks, it demonstrates good stability in multi-object tasks. The success rate of all models decreases as the number of consecutive targets increases because if the agent fails to find an intermediate target, subsequent targets may be farther away, resulting in loss of graph reasoning or training memory guidance. However, when the number of targets increases, the performance gap between ours and SemExp is smaller, indicating that the graph updates structural information over time to adapt to changes in the number of targets. Additionally, as time progresses, graph information becomes increasingly important for the agent to explore new environments. Exploratory SLAM mapping methods are effective in the initial stages when facing unknown environments, but for semantically similar regions, these methods are prone to cumulative errors and make incorrect judgments.

Table 2 Impact of different MMSG component in ProcTHOR. It can be found that the more graph accounts, the better performance gets. L>5 means the agent that finds the target object at least 5 steps

Full size table

Table 3 SR ($\%$), SPL ($\%$), and SoftSPL (SSPL $\%$) comparison between different modes on three datasets, where bold denotes merteic significant superior than other models

Full size table

Additionally, the higher performance on the Gibson and HM3D datasets compared to the MP3D dataset can be attributed to the quality of the 3D scans. The visual fidelity of the MP3D dataset is significantly lower than that of the HM3D dataset; while, scenes from the Gibson dataset have been manually repaired and verified to be free of holes and artifacts.

4.5 Ablation study

To demonstrate the effective of scene understanding and common sense graph inference, we also design GLIP on Wheel (GoW) based on [50]. As an alternative to using graph reasoning in MMSG, GoW consistently selects the nearest frontier during navigation, with a maximum distance of 1.8 ms. It’s worth noticing that GoW shares the same navigation policy as MMSG, except for the frontier selection strategy.

Impact of scene understanding and graph inference. In Table 3 , GLIP based on open-world object detection outperforms CoW on all metrics in ProcTHOR, indicating that GLIP is more suitable for multi-modal understanding. Additionally, MMSG outperforms GoW on all datasets and metrics, demonstrating that graph-based reasoning from coarse to fine-grained levels exhibits better generalization performance compared to heuristic exploration.

Table 4 Comparison of Different Method on ProcTHOR datasets

Full size table

In Fig. 7, although our model performs slightly lower than the VLFM method in single-object search tasks, it demonstrates good stability in multi-object tasks. The success rate of all models decreases as the number of consecutive targets increases because if the agent fails to find an intermediate target, subsequent targets may be farther away, resulting in loss of graph reasoning or training memory guidance. However, when the number of targets increases, the performance gap between ours and SemExp is smaller, indicating that the graph updates structural information over time to adapt to changes in the number of targets. Additionally, as time progresses, graph information becomes increasingly important for the agent to explore new environments. Exploratory SLAM mapping methods are effective in the initial stages when facing unknown environments, but for semantically similar regions, these methods are prone to cumulative errors and make incorrect judgments.

Impact of different LLMs. Comparing the inference performance of different LLMs on HM3D, we input them into GPT$-$3.5, LLaMa7B, and LLaVa for inference. All three LLMs significantly improve the performance over GoW. GPT performs similarly to LLaVa when using the room-prompt without specific common sense training.

Impact of MMSG. As shown in Table 2, our MMSG adapts to changes in scene layouts and facilitates knowledge reasoning. The significantly improved SPL indicates that MMSG enhances navigation efficiency. Furthermore, to investigate the impact of MMSG on generalization, we train models with $25$, $50$, and $75\%$ of MMSG nodes, representing randomly sampled subsets of MMSG nodes at $25$, $50$, and $75\%$, respectively, along with corresponding object. Results show that a higher proportion of nodes in MMSG leads to better success rates and SPL performance.

After the training process, when utilizing Graph Attention Network (GAT) embedding weight matrices for navigation, our results demonstrate that our proposed method, Multi-Modal Scene Graph (MMSG), exhibits superior robustness and faster extraction of key features. This can be attributed to the precise construction of our MMSG, in contrast to other approaches [10, 49] that solely rely on object detectors. Such detectors are prone to being affected by differences in appearance when transitioning from real to simulated environments. The effectiveness of our approach is further supported by the visualizations in Fig. 8, which highlight how the model integrates prior knowledge from the MMSG. As depicted in Fig. 8, during the search for "Dinne Table," our agent concentrate more on relative larger area, leading it to make a precise localization.

4.6 Results and discussion

The quantitative results of the comparative study are presented in Table 4. As indicated by the findings, random walking without any specialized navigation policy results in failure in nearly all episodes. However, when leveraging the map-based framework to randomly sample the long-term goal, the performance surpasses even that of the classical frontier-based method [18]. This underscores the significant advantage of the map-based approach in enabling the robot to swiftly and roughly explore the environment. Additionally, the notable improvement achieved by SemExp [2] underscores the importance of semantic information in efficient exploration. PONI [34] further enhances performance while reducing computational costs, showcasing its ability to learn semantic priors in a distinct manner from other reinforcement learning approaches. Our framework consistently outperforms all baselines across both datasets, demonstrating notable enhancements over the SemExp [2] and PONI [34] baselines. The comparison between the feed-forward and zero-shot approaches suggests that the feed-forward method learns more precise relevance in large indoor scenes.

5 Conclusion

We presented MMSG, a novel multi-modal knowledge graph-based LLM visual navigation agent that applies large language models to facilitate visual navigation by examining two paradigms that infer the semantic relevance from the observed zones. By implementing experiments in ProTHOR datasets, we demonstrate that MMSG agent can utilize the LLM for efficient visual navigation only using cameras. Our findings suggest that large language models hold immense potential in aiding robots in such tasks by providing useful knowledge. Future research should consider the design of the interaction between the robot and LLM. This purely visual solution significantly reduces the production costs of traditional home service robots. It also contributes to urban sustainable development by conserving resources and protecting the environment, while providing humans with hygienic, safe, and barrier-free green smart home services.

Data availability

Data will be available on request.

References

Anderson P, Chang A, Chaplot D.S, Dosovitskiy A, Gupta S, Koltun V, Kosecka J, Malik J, Mottaghi R, Savva M, et al (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757
Chaplot D.S, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020) Learning to explore using active neural slam. arXiv preprint arXiv:2004.05155
Chaplot DS, Gandhi DP, Gupta A, Salakhutdinov RR (2020) Object goal navigation using goal-oriented semantic exploration. Adv Neural Inform Process Syst 33:4247–4258
Google Scholar
Chen J, Li G, Kumar S, Ghanem B, Yu F (2023) How to not train your dragon: training-free embodied object goal navigation with semantic frontiers. Proceedings of Robotics: Science and System XIX, p 075
Chen W, Hu S, Talak R, Carlone L (2022) Leveraging large language models for robot 3d scene understanding. arXiv preprint arXiv:2209.05629
Comşa IS, Molnar A, Tal I, Imhof C, Bergamin P, Muntean GM, Muntean CH, Trestian R (2023) Improved quality of online education using prioritized multi-agent reinforcement learning for video traffic scheduling. IEEE Trans Broadcast 69(2):436–454
Article Google Scholar
Dai W, Li J, Li D, Tiong AMH, Zhao J, Wang W, Li B, Fung P, Hoi SCH (2023) Instructblip: towards general-purpose vision-language models with instruction tuning. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Dec 10 - 16, 2023
Dang R, Shi Z, Wang L, He Z, Liu C, Chen Q (2022) Unbiased Directed Object Attention Graph for Object Navigation. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 3617–3627
Druon R, Yoshiyasu Y, Kanezaki A, Watt AM (2020) Visual object search by learning spatial context. IEEE Robot Autom Lett PP(99):1–1
Google Scholar
Du H, Yu X, Zheng L (2020) Learning Object Relation Graph and Tentative Policy for Visual Navigation. In: European Conference on Computer Vision, pp 19–34. Springer
Du H, Yu X, Zheng L (2021) Vtnet: Visual Transformer Network for Object Goal Navigation. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net . https://openreview.net/forum?id=DILxQP08O3B
Elinas P, Hoey J, Little J.J (2003) Homer: human oriented messenger robot. In: AAAI Spring Symposium on Human Interaction with Autonomous Systems in Complex Environments, pp 45–51
Finn C, Abbeel P, Levine S (2017) Model-Agnostic Meta-Learning for Fast Adaptation of deep networks. In: International Conference on Machine Learning, pp 1126–1135. PMLR
Fung A, Benhabib B, Nejat G (2023) Robots autonomously detecting people: a multimodal deep contrastive learning method robust to intraclass variations. IEEE Robot Autom Lett 8(6):3550–3557
Article Google Scholar
Fung A, Benhabib B, Nejat G (2024) Ldtrack: dynamic people tracking by service robots using diffusion models. arXiv preprint arXiv:2402.08774
Gadre S.Y, Wortsman M, Ilharco G, Schmidt L, Song S (2023) Cows on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 23171–23181
Griffin B, Florence V, Corso J (2020) Video Object Segmentation-Based Visual Servo Control and Object Depth Estimation on a Mobile Robot. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1647–1657
Hasan M, Hoque A, Szecsi T (2010) Application of a Plug-and-Play Guidance Module for Hospital Robots. In: Proceedings of the 2010 international conference on industrial engineering and operations management, pp 654–659
Hu J, Ma Y, Jiang H, He S, Liu G, Weng Q, Zhu X (2024) A new representation of universal successor features for enhancing the generalization of target-driven visual navigation. IEEE Robot Autom Lett
Kwon O, Kim N, Choi Y, Yoo H, Park J, Oh S (2021) Visual Graph Memory with Unsupervised Representation for Visual Navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 15890–15899
Lee J.J, Atrash A, Glas D.F, Fu H (2023) Developing Autonomous Behaviors for a Consumer Robot to be Near People in the Home. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp 197–204. IEEE
Li J, Li D, Savarese S, Hoi S (2023) Blip-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In: International conference on machine learning, pp 19730–19742. PMLR
Liang X, Lee L, Xing E.P (2017) Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 848–857
Lin M, Chen Y, Zhao D, Wang Z (2024) Advancing object goal navigation through llm-enhanced object affinities transfer. arXiv preprint arXiv:2403.09971
Lu H, Liu W, Zhang B, Wang B, Dong K, Liu B, Sun J, Ren T, Li Z, Yang H, et al (2024) Deepseek-vl: Towards real-world vision-language understanding. CoRR
Luo J, Cai B, Yu Y, Ke A, Zhou K, Zhang J (2024) Learning multimodal adaptive relation graph and action boost memory for visual navigation. Adv Eng Inform 62:102678
Article Google Scholar
Majumdar A, Aggarwal G, Devnani B, Hoffman J, Batra D (2022) Zson: zero-shot object-goal navigation using multimodal goal embeddings. Adv Neural Inform Process Syst 35:32340–32352
Google Scholar
Mezghan L, Sukhbaatar S, Lavril T, Maksymets O, Batra D, Bojanowski P, Alahari K (2022) Memory-Augmented Reinforcement Learning for Image-Goal Navigation. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 3316–3323. IEEE
Mohamed SC, Fung A, Nejat G (2022) A multirobot person search system for finding multiple dynamic users in human-centered environments. IEEE Trans Cybern 53(1):628–640
Article Google Scholar
Montemerlo M, Pineau J, Roy N, Thrun S, Verma V (2002) Experiences with a mobile robotic guide for the elderly. AAAI/IAAI 2002:587–592
Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Prakash R, Behera L, Mohan S, Jagannathan S (2020) Dual-loop optimal control of a robot manipulator and its application in warehouse automation. IEEE Trans Autom Sci Eng 19(1):262–279
Article Google Scholar
Rajvanshi A, Sikka K, Lin X, Lee B, Chiu H.P, Velasquez A (2023) Saynav: Grounding large language models for dynamic planning to navigation in new environments. arXiv preprint arXiv:2309.04077
Ramakrishnan S.K, Chaplot D.S, Al-Halah Z, Malik J, Grauman K (2022) Poni: Potential Functions for Objectgoal Navigation with Interaction-Free Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18890–18900
Rana K, Haviland J, Garg S, Abou-Chakra J, Reid I, Suenderhauf N (2023) Sayplan: Grounding Large Language Models Using 3d Scene Graphs for Scalable Robot Task Planning. In: 7th Annual Conference on Robot Learning
Sethian JA (1996) A fast marching level set method for monotonically advancing fronts. Proceedings of the National Academy of Sciences 93(4):1591–1595
Shah D, Osiński B, Levine S, et al (2023) Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In: Conference on robot learning, pp 492–504. PMLR
Shridhar M, Thomason J, Gordon D, Bisk Y, Han W, Mottaghi R, Zettlemoyer L, Fox D (2020) Alfred: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10740–10749
Skoulakis S, Fiez T, Sim R, Piliouras G, Ratliff L (2021) Evolutionary Game Theory Squared: Evolving Agents in Endogenously Evolving Zero-Sum Games. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 11343–11351
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M.A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Wang H, Tan A.H, Nejat G (2024) Navformer: A transformer architecture for robot target-driven navigation in unknown and dynamic environments. IEEE Robot Autom Lett
Wang R, Wang S, Wang Y, Cheng L, Tan M (2020) Development and motion control of biomimetic underwater robots: a survey. IEEE Trans Syst Man Cybern: Syst 52(2):833–844
Article Google Scholar
Wortsman M, Ehsani K, Rastegari M, Farhadi A, Mottaghi R (2019) Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6750–6759
Xu N, Wang W, Yang R, Qin M, Lin Z, Song W, Zhang C, Gu J, Li C (2024) Aligning knowledge graph with visual perception for object-goal navigation. arXiv preprint arXiv:2402.18892
Yang W, Wang X, Farhadi A, Gupta A, Mottaghi R (2018) Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543
Yang W, Wang X, Farhadi A, Gupta A, Mottaghi R (2018) Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543
Yokoyama N.H, Ha S, Batra D, Wang J, Bucher B (2023) Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In: 2nd Workshop on Language and Robot Learning: Language as Grounding
Yu B, Kasaei H, Cao M (2023) L3mvn: Leveraging Large Language Models for Visual Target Navigation. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 3554–3560. IEEE
Zhang S, Song X, Bai Y, Li W, Chu Y, Jiang S (2021) Hierarchical Object-to-Zone Graph for Object Navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 15130–15140
Zhou K, Zheng K, Pryor C, Shen Y, Jin H, Getoor L, Wang XE (2023) Esc: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation. In: International Conference on Machine Learning, pp 42829–42842. PMLR
Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In: The Twelfth International Conference on Learning Representations
Zhu Y, Mottaghi R, Kolve E, Lim J.J, Farhadi A (2017) Target-Driven Visual Navigation in Indoor Scenes Using Deep Reinforcement Learning. 2017 IEEE International Conference on Robotics and Automation (ICRA)

Download references

Acknowledgements

This work was supported by the Key Science and Technology Research of Henan Province, China (Grant No. 242102211105, 242102211029, 242102231011, 242300420289, 232102210129, 232102210076, 232102210074, and 232102211038) and Postgraduate Joint Training Base Project of Henan Province, China (Grant No.YJS2022JD45). We acknowledge the support of the Key Science and Technology Research of Henan Province, China (Grant No. 222102210279 and 232102211038)

Funding

Open access publishing enabled by City University of Hong Kong Library's agreement with Springer Nature.

Author information

Authors and Affiliations

School of Computer Science and Artificial Intelligence, Huanghuai University, Zhumadian, 463000, China
Yu He & T. Lifang Tian
Henan Key Laboratory of Smart Lighting, Zhumadian, 463000, China
Yu He & T. Lifang Tian
Henan International Joint Laboratory of Behavior Optimization Control for Smart Robots, Henan, 463000, China
Yu He & T. Lifang Tian
School of Computer Science, City University of Hong Kong, Hong Kong, 999077, China
Kang Zhou
School of Computer Science, Wuhan University, Wuhan, 430072, China
Kang Zhou

Authors

Yu He
View author publications
You can also search for this author in PubMed Google Scholar
Kang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
T. Lifang Tian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YH contributed to theory, methodology, experiment, and writing. KZ and LT contributed to methodology and discussing

Corresponding author

Correspondence to Kang Zhou.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Ethical approval

No ethics involved.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 45 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

He, Y., Zhou, K. & Tian, T.L. Multi-modal scene graph inspired policy for visual navigation. J Supercomput 81, 107 (2025). https://doi.org/10.1007/s11227-024-06541-8

Download citation

Accepted: 14 September 2024
Published: 30 October 2024
DOI: https://doi.org/10.1007/s11227-024-06541-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-modal scene graph inspired policy for visual navigation

Abstract

Similar content being viewed by others

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Natural Language-Guided Semantic Navigation Using Scene Graph

Zero-Shot Object Navigation with Vision-Language Models Reasoning

1 Introduction

2 Related work

2.1 Visual navigation

2.2 Large language models for visual navigation

2.3 Multi-modal large language models

3 Proposed visual navigation method

3.1 Multi-modal knowledge graph generation

3.2 Grounding text prompt

3.3 Evaluation metrics

3.4 Task planning and navigation policy

3.4.1 Global policy

3.4.2 Local policy

4 Experiments

4.1 Experimental setup

4.2 Implementation details

4.3 Baseline and SOTA comparison

4.4 Benchmark results

4.5 Ablation study

4.6 Results and discussion

5 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (PDF 45 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation