DIAR: Diffusion-model-guided Implicit
Q-learning with Adaptive Revaluation

Jaehyun Park1  Yunho Kim1  Sejin Kim1  Byung-Jun Lee2  Sundong Kim1
1Gwangju Institute of Science and Technology   2Korea University
Abstract

We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

1 Introduction

Refer to caption
(a) Maze2D-umaze
Refer to caption
(b) Maze2D-medium
Refer to caption
(c) Maze2D-large
Figure 1: Performance comparison across D4RL environments with long-horizon and sparse-reward tasks, specifically Maze2D. Our method, DIAR, consistently outperforms other diffusion-based planning frameworks, including Diffuser and LDCQ.

Offline reinforcement learning (offline RL) is a type of reinforcement learning where the agent learns a policy from pre-collected datasets, rather than collecting data through direct interactions with the environment (Fujimoto et al., 2019). Since offline RL does not involve learning in the real environment, it avoids safety-related issues. Additionally, offline RL can efficiently leverage the collected data, making it particularly useful when data collection is expensive or time-consuming. However, offline RL relies on the given dataset, so the learned policy may be inefficient or misdirected if the data is poor quality or biased. Furthermore, a distributional shift may arise during the process of learning from offline data (Levine et al., 2020), leading to degraded performance in the true environment.

To overcome the limitations of offline RL, existing research have been made to address these issues by leveraging diffusion models, a type of generative model (Janner et al., 2022). Incorporating diffusion models allows for learning the overall distribution of the state and action spaces, allowing decisions to be made based on this knowledge. Methods such as Diffuser (Janner et al., 2022) and Decision Diffuser (DD) (Ajay et al., 2023) use diffusion models to predict decisions not autoregressively one step at a time, but instead by inferring the entire decision for the length of the horizon at once, achieving strong performance in long-horizon tasks. Additionally, methods like LDCQ (Venkatraman et al., 2024) propose using latent diffusion models to learn the Q-function, allowing the Q-function to make more appropriate predictions for out-of-distribution state-actions.

Recent studies of diffusion-based offline RL methods, often bypass the use of the Q-function or rely on other offline Q-learning methods (Janner et al., 2022; Ajay et al., 2023). However, recent research has proposed a novel approach that does not avoid the Q-function but instead leverages diffusion models to assist in Q-learning (Wang et al., 2023; Venkatraman et al., 2024). This approach enables handling a wide range of Q-values for diverse states and actions. We found that the performance could be further enhanced if it effectively handles cases where the out-of-distribution data generated by the diffusion model.

Therefore, we propose Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR), which integrates the value function into the training and decision process. This approach provides more objective assessment of the current state, enabling the Q-function to achieve a balance between long-horizon decision-making and step-by-step refinement. In the training process, the Q-function and value function alternates between learning from the dataset and samples generated by the diffusion model, allowing it to adapt to a wide variety of scenarios. Additionally, if the value of the current state is higher than the value after executing the action sequence, the agent can explore new action sequences.

DIAR consistently outperforms existing offline RL algorithms, especially in environments that involve complex route planning and long-horizon state-action pairs like Figure 2. Additionally, as shown in Figure 1, DIAR achieves state-of-the-art performance in environments such as Maze2D, AntMaze, and Kitchen (Fu et al., 2020). This research highlights the potential of diffusion models to enhance both policy abstraction and adaptability in offline RL, with significant implications for real-world applications in robotics and autonomous systems.

Refer to caption
Refer to caption
(a) Maze2D-medium hard cases
Refer to caption
Refer to caption
(b) Maze2D-large hard cases
Figure 2: DIAR-generated trajectories in challenging Maze2D situations. DIAR reliably reaches the goal even from starting points (blue) that are far from the goal (red). DIAR shows strong performance regardless of starting position.

2 Related Work

2.1 Offline Reinforcement Learning

Offline reinforcement learning (offline RL), also referred to as batch reinforcement learning, has gained significant attention in recent years due to its potential in learning effective policies from pre-collected datasets without further interaction with the environment. This paradigm is particularly useful in real-world applications where exploration can be costly or dangerous, such as healthcare, robotics (Kalashnikov et al., 2018), and autonomous driving.

One of the primary challenges in offline RL is the issue of out-of-distribution actions (Kumar et al., 2019), where a learned policy selects actions not well represented in the offline dataset. To address this, several works have introduced behavior regularization techniques that constrain the policy to remain close to the behavior policy seen in the offline data. Among these, Conservative Q-learning (CQL) introduces a conservative Q-function that underestimates the value of out-of-distribution actions, reducing the likelihood of the learned policy selecting potentially harmful actions (Kumar et al., 2020). By minimizing the overestimation of value functions, CQL facilitates more reliable policy learning from offline data. Another notable approach is Implicit Q-learning (IQL), which implicitly regularizes the learned Q-function by keeping it close to the empirical value of the actions observed in the dataset (Kostrikov et al., 2022). This prevents the over-optimization of Q-values for actions that are rarely or never observed in the offline dataset. Additionally, Batch-Constrained Q-learning (BCQ) imposes direct limitations on the learned policy to prevent deviations from the actions observed in the offline dataset (Fujimoto et al., 2019). BCQ introduces a constraint that ensures the learned policy selects actions similar to the behavior policy, thus avoiding the exploitation of inaccurate Q-value estimates for unseen actions.

2.2 Diffusion-based Planning in Offline RL

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have shown remarkable performance in fields such as image inpainting (Lugmayr et al., 2022) and image generation (Ramesh et al., 2022; Saharia et al., 2022). Recent research has extended the application of diffusion models beyond image domains to address classical trajectory optimization challenges in offline RL. One prominent model, Diffuser (Janner et al., 2022), directly learns trajectory distributions and generates tailored trajectories based on situational demands. By prioritizing trajectory accuracy over single-step precision, Diffuser mitigates compounding errors and adapts to novel tasks or goals unseen during training. Additionally, Decision Diffuser (DD) was introduced, which predicts the next state using a state diffusion model and leverages inverse dynamics for decision-making (Ajay et al., 2023). Furthermore, a method called Latent Diffusion-Constrained Q-learning (LDCQ) has been proposed, which combines latent diffusion models with Q-learning to reduce extrapolation errors (Venkatraman et al., 2024). Emerging methods also focus on learning interpretable skills from visual and language inputs and applying conditional planning via diffusion models (Liang et al., 2024). Approaches that generate goal-divergent trajectories using Gaussian noise and facilitate reverse training through denoising processes have also been explored (Jain & Ravanbakhsh, 2023).

3 Preliminary: Latent Diffusion Reinforcement Learning

To train the Q-network, a diffusion model that has trained based on latent representations is required. The first step is to learn how to represent an action-state sequence of length H as a latent vector using β𝛽\betaitalic_β-Variational Autoencoder (β𝛽\betaitalic_β-VAE) (Pertsch et al., 2021). The second step is to train the diffusion model using the latent vectors generated by the encoder of the β𝛽\betaitalic_β-VAE. This allows the diffusion model to learn the latent space corresponding to the action-state sequence. Subsequently, the Q-network is trained using the latent vectors generated by the diffusion model.

Latent representation by β𝛽\betaitalic_β-VAE

The β𝛽\betaitalic_β-VAE plays three key roles in the initial stage of our model training. First, the encoder qθE(𝒛|𝒔t:t+H,𝒂t:t+H)subscript𝑞subscript𝜃𝐸conditional𝒛subscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻q_{{\theta_{E}}}(\bm{z}|\bm{s}_{t:t+H},\bm{a}_{t:t+H})italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ) must effectively represent the action-state sequence 𝒔t:t+H,𝒂t:t+Hsubscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻\bm{s}_{t:t+H},\bm{a}_{t:t+H}bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT from the dataset 𝒟𝒟\mathcal{D}caligraphic_D into a latent vector 𝒛𝒛\bm{z}bold_italic_z. Second, the distribution of 𝒛𝒛\bm{z}bold_italic_z generated by the β𝛽\betaitalic_β-VAE must be conditioned by the state prior pθs(𝒛|𝒔t)subscript𝑝subscript𝜃𝑠conditional𝒛subscript𝒔𝑡p_{{\theta_{s}}}(\bm{z}|\bm{s}_{t})italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This is learned by minimizing the KL-divergence between the latent vector generated by the encoder and the one generated by the state prior. The formation of the latent vector is controlled by adjusting the β𝛽\betaitalic_β value, which determines the weight of KL-divergence. Lastly, the policy decoder πθD(𝒂t|𝒔t,𝒛)subscript𝜋subscript𝜃𝐷conditionalsubscript𝒂𝑡subscript𝒔𝑡𝒛\pi_{\theta_{D}}(\bm{a}_{t}|\bm{s}_{t},\bm{z})italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z ) of the β𝛽\betaitalic_β-VAE must be able to accurately decode actions when given the current state and latent vector as inputs. These three objectives are combined to train the β𝛽\betaitalic_β-VAE by maximizing the evidence lower bound (ELBO) (Kingma & Welling, 2014) as shown in Eq. 1.

(θ)=𝔼𝒟[𝔼qθE[i=tt+H1logπθD(𝒂i|𝒔i,𝒛)]βDKL(qθE(𝒛|𝒔t:t+H,𝒂t:t+H)pθs(𝒛|𝒔t))]\mathcal{L}(\theta)=\mathbb{E}_{\mathcal{D}}\bigl{[}\mathbb{E}_{q_{{\theta_{E}% }}}\bigl{[}\sum_{i=t}^{t+H-1}\log\pi_{\theta_{D}}(\bm{a}_{i}|\bm{s}_{i},\bm{z}% )\bigr{]}-\beta D_{KL}(q_{{\theta_{E}}}(\bm{z}|\bm{s}_{t:t+H},\bm{a}_{t:t+H})% \parallel p_{{\theta_{s}}}(\bm{z}|\bm{s}_{t}))\bigr{]}caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_H - 1 end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_z ) ] - italic_β italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] (1)

Training latent vector with a diffusion model

The latent diffusion model (LDM) effectively learns latent representations, focusing on the latent space instead of the original data samples (Rombach et al., 2022). The model minimizes a loss function that predict the initial latent 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated by the VAE encoder qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, rather than noise as in traditional diffusion models. H𝐻Hitalic_H-length trajectory segments 𝒔t:t+H,𝒂t:t+Hsubscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻\bm{s}_{t:t+H},\bm{a}_{t:t+H}bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT are sampled from dataset 𝒟𝒟\mathcal{D}caligraphic_D and paired with initial states and latent variables (𝒔t,𝒛t)subscript𝒔𝑡subscript𝒛𝑡(\bm{s}_{t},\bm{z}_{t})( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The focus lies on modeling the prior p(𝒛|𝒔t)𝑝conditional𝒛subscript𝒔𝑡p(\bm{z}|\bm{s}_{t})italic_p ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to capture the distribution of latent 𝒛𝒛\bm{z}bold_italic_z given the state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. A conditional latent diffusion model μψ(𝒛|𝒔t)subscript𝜇𝜓conditional𝒛subscript𝒔𝑡\mu_{\psi}(\bm{z}|\bm{s}_{t})italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is utilized and refined with a time-dependent denoising function μψ(𝒛j,𝒔t,j)subscript𝜇𝜓superscript𝒛𝑗subscript𝒔𝑡𝑗\mu_{\psi}(\bm{z}^{j},\bm{s}_{t},j)italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j ) to reconstruct 𝒛0superscript𝒛0\bm{z}^{0}bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT through the denoising step j[1,T]similar-to𝑗1𝑇j\sim[1,T]italic_j ∼ [ 1 , italic_T ]. Consequently, the LDM is trained by minimizing the loss function (ψ)𝜓\mathcal{L}(\psi)caligraphic_L ( italic_ψ ) as given in Eq. 2.

(ψ)=𝔼j[1,T],(𝒔,𝒂)𝒟,𝒛tqϕ(𝒛|𝒔,𝒂),𝒛jμψ(𝒛j|𝒛0)(𝒛tμψ(𝒛j,𝒔t,j)2)𝜓subscript𝔼formulae-sequencesimilar-to𝑗1𝑇formulae-sequencesimilar-to𝒔𝒂𝒟formulae-sequencesimilar-tosubscript𝒛𝑡subscript𝑞italic-ϕconditional𝒛𝒔𝒂similar-tosuperscript𝒛𝑗subscript𝜇𝜓conditionalsuperscript𝒛𝑗superscript𝒛0superscriptnormsubscript𝒛𝑡subscript𝜇𝜓superscript𝒛𝑗subscript𝒔𝑡𝑗2\mathcal{L}(\psi)=\mathbb{E}_{j\sim[1,T],(\bm{s},\bm{a})\sim\mathcal{D},\bm{z}% _{t}\sim q_{\phi}(\bm{z}|\bm{s},\bm{a}),\bm{z}^{j}\sim\mu_{\psi}(\bm{z}^{j}|% \bm{z}^{0})}\bigl{(}\|\bm{z}_{t}-\mu_{\psi}(\bm{z}^{j},\bm{s}_{t},j)\|^{2}% \bigr{)}caligraphic_L ( italic_ψ ) = blackboard_E start_POSTSUBSCRIPT italic_j ∼ [ 1 , italic_T ] , ( bold_italic_s , bold_italic_a ) ∼ caligraphic_D , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s , bold_italic_a ) , bold_italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( ∥ bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (2)

Q-network by latent representation

To train the Q-network, Eq. 3 reduces extrapolation errors by restricting policy updates to the empirical distribution of the offline dataset (Venkatraman et al., 2024). Prioritizing trajectory accuracy over single-step precision allows the model to mitigate compounding errors and remain adaptable to novel tasks or goals unseen during training. Furthermore, the integration of temporal abstraction and latent space modeling notably enhances the mechanisms underlying credit assignment and improves the effectiveness of policy optimization.

Q(𝒔t,𝒛t)Q(𝒔t,𝒛t)+α[rt:t+H+γmax𝒛t+HμψQ(𝒔t+H,𝒛t+H)Q(𝒔t,𝒛t)]𝑄subscript𝒔𝑡subscript𝒛𝑡𝑄subscript𝒔𝑡subscript𝒛𝑡𝛼delimited-[]subscript𝑟:𝑡𝑡𝐻𝛾subscriptsimilar-tosuperscriptsubscript𝒛𝑡𝐻subscript𝜇𝜓𝑄subscript𝒔𝑡𝐻superscriptsubscript𝒛𝑡𝐻𝑄subscript𝒔𝑡subscript𝒛𝑡Q(\bm{s}_{t},\bm{z}_{t})\leftarrow Q(\bm{s}_{t},\bm{z}_{t})+\alpha\left[r_{t:t% +H}+\gamma\max_{\bm{z}_{t+H}^{\prime}\sim\mu_{\psi}}Q(\bm{s}_{t+H},\bm{z}_{t+H% }^{\prime})-Q(\bm{s}_{t},\bm{z}_{t})\right]italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_α [ italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (3)

The latent vector 𝒛t+Hsuperscriptsubscript𝒛𝑡𝐻\bm{z}_{t+H}^{\prime}bold_italic_z start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT generated by the diffusion model is utilized in the training of the Q-function. The Q-function learns the relation between the Q(𝒔t+H,𝒛t+H)𝑄subscript𝒔𝑡𝐻superscriptsubscript𝒛𝑡𝐻Q(\bm{s}_{t+H},\bm{z}_{t+H}^{\prime})italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and Q(𝒔t,𝒛t)𝑄subscript𝒔𝑡subscript𝒛𝑡Q(\bm{s}_{t},\bm{z}_{t})italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) like Eq. 3, which are based on the initial state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT pairs present in the dataset, and the 𝒛t+Hsuperscriptsubscript𝒛𝑡𝐻\bm{z}_{t+H}^{\prime}bold_italic_z start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT generated by the diffusion model. rt:t+Hsubscript𝑟:𝑡𝑡𝐻r_{t:t+H}italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT denotes the sum of rewards with discount factor γ𝛾\gammaitalic_γ. This enables the model to adapt to new tasks or goals that were not observed in the offline data. Furthermore, the integration of temporal abstraction and latent space modeling significantly enhances the mechanism of credit assignment, thereby improving the effectiveness of policy optimization. According to Eq. 3, the trained Q-function is used such that, as shown in Eq. 4, when a state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given, the decision is made by selecting the action that has the highest Q-value.

π(𝒔t)=πθ(𝒂t|argmax𝒛iμψ(𝒛|𝒔t)Q(𝒔t,𝒛i))𝜋subscript𝒔𝑡subscript𝜋𝜃conditionalsubscript𝒂𝑡subscriptargmaxsimilar-tosubscript𝒛𝑖subscript𝜇𝜓conditional𝒛subscript𝒔𝑡𝑄subscript𝒔𝑡subscript𝒛𝑖\pi(\bm{s}_{t})=\pi_{\theta}(\bm{a}_{t}|\operatorname*{arg\,max}_{\bm{z}_{i}% \sim\mu_{\psi}(\bm{z}|\bm{s}_{t})}Q(\bm{s}_{t},\bm{z}_{i}))italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (4)

4 Proposed Method

Using diffusion models to address long-horizon tasks typically involves training over the full trajectory length (Janner et al., 2022). This approach differs from autoregressive methods that focus on selecting the best action at each step, as it learns the entire action sequence over the horizon. This allows the model to learn long sequences of decisions at once and generate a large number of actions in a single pass. However, predicting decisions over the entire horizon may not always lead to the optimal outcome, as it does not select the best action at each step.

Additionally, there is a well-known problem of overestimating the Q-value when training a Q-network (Hasselt et al., 2016; 2018; Fu et al., 2019; Kumar et al., 2019; Agarwal et al., 2020). This occurs when certain actions, appearing intermittently, are assigned a high Q(𝒔,𝒂)𝑄𝒔𝒂Q(\bm{s},\bm{a})italic_Q ( bold_italic_s , bold_italic_a ) value. In these cases, the state may not actually hold high value, but the Q-value becomes “lucky” and inflated. Therefore, it is essential to ensure that the Q-network does not overestimate and can correctly assess the value based on the current state.

To resolve both of these issues, we propose Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR), introducing a value function to assess the value of each situation. Unlike the Q-network, which learns the value of both state and action, the state-value function learns only the value of the state. By introducing constraints from the value function, we can train a more balanced Q-network and, during the decision-making phase, make more optimal predictions with the help of the value function.

Refer to caption
(a) Train β𝛽\betaitalic_β-VAE
Refer to caption
(b) Train latent diffusion model
Refer to caption
(c) Train Q-network and value-network
Figure 3: Three training stages of DIAR. (a) The β𝛽\betaitalic_β-VAE is trained by encoding a state-action sequence spanning an H𝐻\mathit{H}italic_H-length horizon into a latent space, followed by a policy decoder that outputs actions based on the encoded latent 𝒛𝒛\bm{z}bold_italic_z and the state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contained within it. (b) A diffusion model is trained using the encoded latent and the initial state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. (c) The Q-network is trained on the offline dataset, while the value network is trained on data generated by the diffusion model. This interplay allows the value function and Q-function to guide each other, enabling more balanced learning across both offline samples and generated data.

4.1 Diffusion-model-guided Q-learning Framework

The value-network Vηsubscript𝑉𝜂V_{\eta}italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT with parameter η𝜂\etaitalic_η evaluates the value of the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the Q-network Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT with parameter ϕitalic-ϕ\phiitalic_ϕ evaluates the value of the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and action 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Additionally, by combining value-network learning with Q-network learning, constraints can be applied to the Q-network, resulting in more balanced training. Furthermore, instead of merely relying on the dataset for value-network and Q-network learning, we incorporate latent vectors generated by the diffusion model. This reduces extrapolation error for new decisions not encountered in the dataset, allowing for more accurate value estimation.

The training of the value-network should aim to reduce the difference between the Q-value and the state-value. Therefore, it is crucial to include the difference between Q(𝒔,𝒛)𝑄𝒔𝒛Q(\bm{s},\bm{z})italic_Q ( bold_italic_s , bold_italic_z ) and V(𝒔)𝑉𝒔V(\bm{s})italic_V ( bold_italic_s ) in the loss function. To achieve this, rather than simply using MSE loss, we apply weights to make the data distribution more flexible and to respond more sensitively to differences. We use an asymmetric weighted loss function that multiplies the weights of variables u𝑢uitalic_u by an expectile factor τ𝜏\tauitalic_τ, as shown in Eq. 5. In the next step, u𝑢uitalic_u is used as the difference between the Q-value and the state-value for loss calculation.

Lτ2(u)=|τ𝕀(u<0)|u2superscriptsubscript𝐿𝜏2𝑢𝜏𝕀𝑢0superscript𝑢2L_{\tau}^{2}(u)=|\tau-\mathbb{I}(u<0)|u^{2}italic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_u ) = | italic_τ - blackboard_I ( italic_u < 0 ) | italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

By using an asymmetrically weighted loss function, the value-network is trained to reduce the difference between the Q-value and the state-value. We set τ𝜏\tauitalic_τ to a value greater than 0.5 and apply Eq. 6 to assign more weight when the difference between the Q-value and the value is large. Additionally, instead of using latent vector encoded from the dataset, we use latent vectors 𝒛~tsubscript~𝒛𝑡\tilde{\bm{z}}_{t}over~ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated by the diffusion model to guide the learning of a more generalized Q-network.

LV(η)=𝔼𝒔t𝒟,𝒛~t𝒟ψ[Lτ2(Qϕ^(𝒔t,𝒛~t)Vη(𝒔t))]subscript𝐿𝑉𝜂subscript𝔼formulae-sequencesimilar-tosubscript𝒔𝑡𝒟similar-tosubscript~𝒛𝑡subscript𝒟𝜓delimited-[]superscriptsubscript𝐿𝜏2subscript𝑄^italic-ϕsubscript𝒔𝑡subscript~𝒛𝑡subscript𝑉𝜂subscript𝒔𝑡L_{V}(\eta)=\mathbb{E}_{\bm{s}_{t}\sim\mathcal{D},\ \tilde{\bm{z}}_{t}\sim% \mathcal{D}_{\psi}}\left[L_{\tau}^{2}\left(Q_{\hat{\phi}}(\bm{s}_{t},\tilde{% \bm{z}}_{t})-V_{\eta}(\bm{s}_{t})\right)\right]italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_η ) = blackboard_E start_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_D , over~ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] (6)

After the loss for the value-network is calculated, the loss for the Q-network is computed. The loss in Eq. 7 is not based on the Q-network alone but is learned based on the current value and reward, ensuring balance with the value network. The value-network learning, using latent vectors generated by the diffusion model, allows it to handle diverse trajectories, while the Q-network is trained on data pairs (𝒔t,𝒛t,rt:t+H,𝒔t+H)𝒟similar-tosubscript𝒔𝑡subscript𝒛𝑡subscript𝑟:𝑡𝑡𝐻subscript𝒔𝑡𝐻𝒟(\bm{s}_{t},\bm{z}_{t},r_{t:t+H},\bm{s}_{t+H})\sim\mathcal{D}( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) ∼ caligraphic_D from the dataset, learning the Q-value of state-latent vector pairs based on existing trajectories. Q-network and value-network training processes form a complementary relationship.

LQ(ϕ)=𝔼(𝒔t,𝒛t,rt:t+H,𝒔t+H)𝒟[(rt:t+H+γVη(𝒔t+H)Qϕ(𝒔t,𝒛t))2]subscript𝐿𝑄italic-ϕsubscript𝔼similar-tosubscript𝒔𝑡subscript𝒛𝑡subscript𝑟:𝑡𝑡𝐻subscript𝒔𝑡𝐻𝒟delimited-[]superscriptsubscript𝑟:𝑡𝑡𝐻𝛾subscript𝑉𝜂subscript𝒔𝑡𝐻subscript𝑄italic-ϕsubscript𝒔𝑡subscript𝒛𝑡2L_{Q}(\phi)=\mathbb{E}_{(\bm{s}_{t},\bm{z}_{t},r_{t:t+H},\bm{s}_{t+H})\sim% \mathcal{D}}\left[\left(r_{t:t+H}+\gamma V_{\eta}(\bm{s}_{t+H})-Q_{\phi}(\bm{s% }_{t},\bm{z}_{t})\right)^{2}\right]italic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ ( italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT + italic_γ italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (7)

To ensure stable Deep Q-network training and prevent Q-value overestimation, we employed the Clipped Double Q-learning method (Fujimoto et al., 2018). Additionally, we used a prioritized replay buffer \mathcal{B}caligraphic_B, where the Q-network is trained based on the priority of the samples (Schaul et al., 2016). \mathcal{B}caligraphic_B stores (𝒔t,𝒛t,rt:t+H,𝒔t+H)subscript𝒔𝑡subscript𝒛𝑡subscript𝑟:𝑡𝑡𝐻subscript𝒔𝑡𝐻(\bm{s}_{t},\bm{z}_{t},r_{t:t+H},\bm{s}_{t+H})( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ), which are generated from the offline dataset. The state, action, and reward are taken from the offline dataset, and the latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is encoded by qθE(𝒛|𝒔,𝒂)subscript𝑞subscript𝜃𝐸conditional𝒛𝒔𝒂q_{\theta_{E}}(\bm{z}|\bm{s},\bm{a})italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s , bold_italic_a ). The encoded latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, along with the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is used to guide the MLP model through the diffusion model to learn the Q-value. The Q-network and value-network are trained alternately, maintaining a complementary relationship through their respective loss functions. The value-network’s loss LV(η)subscript𝐿𝑉𝜂L_{V}(\eta)italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_η ) is calculated based on the difference between the Q-value and the state-value, which is adjusted by the expectile factor τ𝜏\tauitalic_τ. The Q-network’s loss LQ(ϕ)subscript𝐿𝑄italic-ϕL_{Q}(\phi)italic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) is computed using the Bellman equation with the reward and value, where the effect of distant timesteps is controlled by the discount factor γ𝛾\gammaitalic_γ. The calculated Q-network loss LQ(ϕ)subscript𝐿𝑄italic-ϕL_{Q}(\phi)italic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) is updated in the model Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT via backpropagation, and the target Q-network Qϕ^subscript𝑄^italic-ϕQ_{\hat{\phi}}italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT is gradually updated based on the update rate ρ𝜌\rhoitalic_ρ. The detailed process can be found in Algorithm 1.

1 Input: Q-network Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, target Q-network Qϕ^subscript𝑄^italic-ϕQ_{\hat{\phi}}italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT, value-network Vηsubscript𝑉𝜂V_{\eta}italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, diffusion model μψ(𝒛|𝒔)subscript𝜇𝜓conditional𝒛𝒔\mu_{\psi}(\bm{z}|\bm{s})italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s ), prioritized replay buffer \mathcal{B}caligraphic_B, horizon H, number of sampling latent vectors n, latent vector 𝒛𝒛\bm{z}bold_italic_z, update rate ρ𝜌\rhoitalic_ρ, max iteration T, learning rate λQ,λVsubscript𝜆𝑄subscript𝜆𝑉\lambda_{Q},\lambda_{V}italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT
2
3ϕ^ϕ^italic-ϕitalic-ϕ\hat{\phi}\leftarrow\phiover^ start_ARG italic_ϕ end_ARG ← italic_ϕ
4t0𝑡0t\leftarrow 0italic_t ← 0
5while t<T𝑡𝑇t<Titalic_t < italic_T do
6       (𝒔t,𝒛t,rt:t+H,𝒔t+H)subscript𝒔𝑡subscript𝒛𝑡subscript𝑟:𝑡𝑡𝐻subscript𝒔𝑡𝐻(\bm{s}_{t},\bm{z}_{t},r_{t:t+H},\bm{s}_{t+H})\leftarrow\mathcal{B}( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) ← caligraphic_B
7      𝒛t+H0,𝒛t+H1,,𝒛t+Hn1μψ(𝒛|𝒔t+H)subscriptsuperscript𝒛0𝑡𝐻subscriptsuperscript𝒛1𝑡𝐻subscriptsuperscript𝒛𝑛1𝑡𝐻subscript𝜇𝜓conditional𝒛subscript𝒔𝑡𝐻\bm{z}^{0}_{t+H},\bm{z}^{1}_{t+H},\ldots,\bm{z}^{n-1}_{t+H}\leftarrow\mu_{\psi% }(\bm{z}|\bm{s}_{t+H})bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ← italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) # Sampling n𝑛nitalic_n latent vectors
8      ηηλVηLV(η)𝜂𝜂subscript𝜆𝑉subscript𝜂subscript𝐿𝑉𝜂\eta\leftarrow\eta-\lambda_{V}\nabla_{\eta}L_{V}(\eta)italic_η ← italic_η - italic_λ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_η ) # Training value-network
9      ϕϕλQϕLQ(ϕ)italic-ϕitalic-ϕsubscript𝜆𝑄subscriptitalic-ϕsubscript𝐿𝑄italic-ϕ\phi\leftarrow\phi-\lambda_{Q}\nabla_{\phi}L_{Q}(\phi)italic_ϕ ← italic_ϕ - italic_λ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) # Training Q-network
10      ϕ^ρϕ+(1ρ)ϕ^^italic-ϕ𝜌italic-ϕ1𝜌^italic-ϕ\hat{\phi}\leftarrow\rho\phi+(1-\rho)\hat{\phi}over^ start_ARG italic_ϕ end_ARG ← italic_ρ italic_ϕ + ( 1 - italic_ρ ) over^ start_ARG italic_ϕ end_ARG
11      Update priority of \mathcal{B}caligraphic_B
12 end while
Algorithm 1 Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

4.2 Adaptive Revaluation in Policy Execution

DIAR method reforms a decision if the value of the current state is higher than the value of the state after making a decision over the horizon length H𝐻Hitalic_H. We refer to this process as Adaptive Revaluation. Using the value-network Vηsubscript𝑉𝜂V_{\eta}italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, if the current state’s value V(𝒔t)𝑉subscript𝒔𝑡V(\bm{s}_{t})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is greater than V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ), the value after making a decision for H𝐻Hitalic_H steps, the method generates a new latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and continues the decision-making process. When predicting over the horizon length, there may be cases where taking a different action midway through the horizon would be more optimal. In such cases, the value-network Vηsubscript𝑉𝜂V_{\eta}italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT checks this, and if the condition is met, a new latent vector is generated.

Refer to caption
Figure 4: Inference step with DIAR. The current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is put into the diffusion model to extract candidate latent vectors. Then, the latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the highest Q(𝒔t,𝒛t)𝑄subscript𝒔𝑡subscript𝒛𝑡Q(\bm{s}_{t},\bm{z}_{t})italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is selected as the best latent vector. This latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is subsequently decoded to generate the action 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Additionally, the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT is also decoded to be used for calculating the future value V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ).
Refer to caption
(a) Non-ideal values for states within a latent in the sparse-reward environment
Refer to caption
(b) Ideal values for states within a latent in the sparse-reward environment
Figure 5: The process of finding a better trajectory using Adaptive Revaluation. The process involves making a decision and taking action based on the skill latent 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the highest Q(𝒔t,𝒛t)𝑄subscript𝒔𝑡subscript𝒛𝑡Q(\bm{s}_{t},\bm{z}_{t})italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The current latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used to predict the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT, based on which the value V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) of the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT is calculated. (a) If the value V(𝒔t)𝑉subscript𝒔𝑡V(\bm{s}_{t})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is greater than the value V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) of the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT, it is considered non-ideal, and re-sampling is performed. (b) If the value V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) of the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT is greater than or equal to the value V(𝒔t)𝑉subscript𝒔𝑡V(\bm{s}_{t})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is considered ideal, and the action 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT decoded by the latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is executed continuously.

Adaptive Revaluation uses the difference in value to examine whether the agent’s predicted decision is optimal. Since the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be obtained directly from the environment, it is easy to compute the value V(𝒔t)𝑉subscript𝒔𝑡V(\bm{s}_{t})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Whether the current trajectory is optimal can be determined using a state decoder fθ(𝒔t+H|𝒔t,𝒛t)subscript𝑓𝜃conditionalsubscript𝒔𝑡𝐻subscript𝒔𝑡subscript𝒛𝑡f_{\theta}(\bm{s}_{t+H}|\bm{s}_{t},\bm{z}_{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). By inputting the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and latent vector 𝒛tsubscript𝒛𝑡\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into the state decoder, the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT can be predicted. This predicted 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT is passed into the value-network Vηsubscript𝑉𝜂V_{\eta}italic_V start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT to estimate its future value V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ). By comparing these two values, if the current value V(𝒔t)𝑉subscript𝒔𝑡V(\bm{s}_{t})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is higher, the agent generates new latent vectors and selects the one with the highest Q(𝒔t,𝒛t)𝑄subscript𝒔𝑡subscript𝒛𝑡Q(\bm{s}_{t},\bm{z}_{t})italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The detailed Adaptive Revaluation algorithm is shown in Appendix B.

4.3 Theoretical Analysis of DIAR

In this section, we prove that in the case of sparse rewards, when the current timestep t𝑡titalic_t, if the value V(𝒔t)𝑉subscript𝒔𝑡V(\bm{s}_{t})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of the current state 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is higher than the value V(𝒔t+H)𝑉subscript𝒔𝑡𝐻V(\bm{s}_{t+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) of the future state 𝒔t+Hsubscript𝒔𝑡𝐻\bm{s}_{t+H}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT, there is a more ideal trajectory than the current trajectory. An ideal trajectory is defined as one where, for all states at timestep k𝑘kitalic_k, the discount factor 0<γ10𝛾10<\gamma\leq 10 < italic_γ ≤ 1 ensures that V(𝒔k)V(𝒔k+1)𝑉subscript𝒔𝑘𝑉subscript𝒔𝑘1V(\bm{s}_{k})\leq V(\bm{s}_{k+1})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ). This means that for an agent performing actions toward a goal, the value of each state in the trajectory increases monotonically.

Now, consider an assumption about an ideal trajectory: for any timesteps i,j𝑖𝑗i,jitalic_i , italic_j with i<j𝑖𝑗i<jitalic_i < italic_j, we assume that V(𝒔i)>V(𝒔j)𝑉subscript𝒔𝑖𝑉subscript𝒔𝑗V(\bm{s}_{i})>V(\bm{s}_{j})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for 𝒔isubscript𝒔𝑖\bm{s}_{i}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒔jsubscript𝒔𝑗\bm{s}_{j}bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from the dataset 𝒟𝒟\mathcal{D}caligraphic_D. Furthermore, since the state 𝒔jsubscript𝒔𝑗\bm{s}_{j}bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is not the goal and we are in a sparse reward setting, r(𝒔i,𝒂i)=0for-all𝑟subscript𝒔𝑖subscript𝒂𝑖0\forall r(\bm{s}_{i},\bm{a}_{i})=0∀ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0. If we write the Bellman equation for the value function, it results in Eq. 8.

V(𝒔i)=𝔼(𝒔i,𝒂i,𝒔i+1)𝒟[r(𝒔i,𝒂)+γV(𝒔i+1)]𝑉subscript𝒔𝑖subscript𝔼similar-tosubscript𝒔𝑖subscript𝒂𝑖subscript𝒔𝑖1𝒟delimited-[]𝑟subscript𝒔𝑖𝒂𝛾𝑉subscript𝒔𝑖1V(\bm{s}_{i})=\mathbb{E}_{(\bm{s}_{i},\bm{a}_{i},\bm{s}_{i+1})\sim\mathcal{D}}% \left[r(\bm{s}_{i},\bm{a})+\gamma V(\bm{s}_{i+1})\right]italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_a ) + italic_γ italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ] (8)

Eq. 8 represents the value function V(𝒔i)𝑉subscript𝒔𝑖V(\bm{s}_{i})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) when there is a difference of one timestep. The value function V(𝒔i)𝑉subscript𝒔𝑖V(\bm{s}_{i})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) can be computed using the reward received from the action taken in the current state 𝒔isubscript𝒔𝑖\bm{s}_{i}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the value of the next state 𝒔i+1subscript𝒔𝑖1\bm{s}_{i+1}bold_italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Therefore, by iterating Eq. 8 to express the timesteps from i𝑖iitalic_i to j𝑗jitalic_j, we obtain Eq. 9.

V(𝒔i)=𝔼(𝒔i:j,𝒂i:j)𝒟[t=ij1γtir(𝒔t,𝒂t)+γjiV(𝒔j)]𝑉subscript𝒔𝑖subscript𝔼similar-tosubscript𝒔:𝑖𝑗subscript𝒂:𝑖𝑗𝒟delimited-[]superscriptsubscript𝑡𝑖𝑗1superscript𝛾𝑡𝑖𝑟subscript𝒔𝑡subscript𝒂𝑡superscript𝛾𝑗𝑖𝑉subscript𝒔𝑗V(\bm{s}_{i})=\mathbb{E}_{(\bm{s}_{i:j},\bm{a}_{i:j})\sim\mathcal{D}}\left[% \sum_{t=i}^{j-1}\gamma^{t-i}r(\bm{s}_{t},\bm{a}_{t})+\gamma^{j-i}V(\bm{s}_{j})\right]italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - italic_i end_POSTSUPERSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] (9)

Since the current environment is sparse in rewards, no reward is given if the goal is not reached. Therefore, in Eq. 9, all reward r(𝒔t,𝒂t)𝑟subscript𝒔𝑡subscript𝒂𝑡r(\bm{s}_{t},\bm{a}_{t})italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) terms are zero. By substituting the reward as zero and reorganizing Eq. 9, we can derive Eq. 10.

V(𝒔i)=𝔼(𝒔i:j,𝒂i:j)𝒟[γjiV(𝒔j)]𝑉subscript𝒔𝑖subscript𝔼similar-tosubscript𝒔:𝑖𝑗subscript𝒂:𝑖𝑗𝒟delimited-[]superscript𝛾𝑗𝑖𝑉subscript𝒔𝑗V(\bm{s}_{i})=\mathbb{E}_{(\bm{s}_{i:j},\bm{a}_{i:j})\sim\mathcal{D}}\left[% \gamma^{j-i}V(\bm{s}_{j})\right]italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] (10)

Since the magnitude of γ𝛾\gammaitalic_γ is 0<γ10𝛾10<\gamma\leq 10 < italic_γ ≤ 1, the term γjiV(𝒔j)superscript𝛾𝑗𝑖𝑉subscript𝒔𝑗\gamma^{j-i}V(\bm{s}_{j})italic_γ start_POSTSUPERSCRIPT italic_j - italic_i end_POSTSUPERSCRIPT italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is always less than or equal to V(𝒔j)𝑉subscript𝒔𝑗V(\bm{s}_{j})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). This contradicts the initial assumption, indicating that the assumption is incorrect. Therefore, for any ideal trajectory, all value functions V(𝒔i)𝑉subscript𝒔𝑖V(\bm{s}_{i})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) must follow a monotonically increasing function. In other words, if the trajectory predicted by the agent is an ideal trajectory, the value V(𝒔j)𝑉subscript𝒔𝑗V(\bm{s}_{j})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) after making a decision over the horizon H𝐻Hitalic_H must always be greater than the current value V(𝒔i)𝑉subscript𝒔𝑖V(\bm{s}_{i})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). If the current value V(𝒔i)𝑉subscript𝒔𝑖V(\bm{s}_{i})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is greater than the future value V(𝒔j)𝑉subscript𝒔𝑗V(\bm{s}_{j})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), then this trajectory is not an ideal trajectory. Consequently, generating a new latent vector 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the current state 𝒔isubscript𝒔𝑖\bm{s}_{i}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to search for an optimal decision is a better approach.

5 Experiments

We compare the performance of our model with other models under various conditions and environments. We focus on goal-based tasks in environments with long-horizons and sparse rewards. For offline RL, we use the Maze2D, AntMaze, and Kitchen datasets to test the strengths of our model in long-horizon sparse reward settings (Fu et al., 2020). These environments feature very long trajectories in their datasets, and rewards are only given upon reaching the goal, making them highly suitable for evaluating our model. We also compare the performance improvements achieved when using Adaptive Revaluation, analyzing whether it allows for reconsideration of decisions when incorrect ones are made and enables the generation of the correct trajectory. Furthermore, to ensure more accurate performance measurements, all scores are averaged over 100 runs and repeated 5 times, with the mean and standard deviation reported.

5.1 Performance on Offline RL Benchmarks

In this section, we compare the performance of our model in offline RL. To evaluate our model, we compare it against various state-of-the-art models. These include behavior cloning (BC), which imitates the dataset, and offline RL methods based on Q-learning, such as IQL (Kostrikov et al., 2022) and IDQL (Hansen-Estruch et al., 2023). We also compare our model with DT (Chen et al., 2021), which uses the transformer architecture employed in LLMs, and methods that use diffusion models, such as Diffuser (Janner et al., 2022), DD (Ajay et al., 2023), and LDCQ (Venkatraman et al., 2024). Through these comparisons with various algorithms, we conduct a quantitative performance evaluation of our model.

Table 1: Comparison with other methods in long horizon sparse reward D4RL environments.
Dataset BC IQL DT IDQL Diffuser DD LDCQ DIAR
maze2d-umaze-v1 3.8 47.4 27.3 57.9 113.5 - 134.2 141.8±plus-or-minus\pm±4.3
maze2d-medium-v1 30.3 34.9 32.1 89.5 121.5 - 125.3 139.2±plus-or-minus\pm±3.5
maze2d-large-v1 5.0 58.6 18.1 90.1 123.0 - 150.1 200.3±plus-or-minus\pm±3.4
antmaze-umaze-diverse-v2 45.6 62.2 54.0 62.0 - - 81.4 88.8±plus-or-minus\pm±1.5
antmaze-medium-diverse-v2 0.0 70.0 0.0 83.5 45.5 24.6 68.9 68.2±plus-or-minus\pm±6.7
antmaze-large-diverse-v2 0.0 47.5 0.0 56.4 22.0 7.5 57.7 60.6±plus-or-minus\pm±2.4
kitchen-complete-v0 65.0 62.5 - - - - 62.5 68.8±plus-or-minus\pm±2.1
kitchen-partial-v0 38.0 46.3 42.0 - - 57.0 67.8 63.3±plus-or-minus\pm±0.9
kitchen-mixed-v0 51.5 51.0 50.7 - - 65.0 62.3 60.8±plus-or-minus\pm±1.4

Datasets like Maze2D and AntMaze require the agent to learn how to navigate from a random starting point to a random location. Simply mimicking the dataset is insufficient for achieving good performance. The agent must learn what constitutes a good decision and how to make the best judgments throughout the trajectory. Additionally, the ability to stitch together multiple paths through trajectory combinations is essential. In particular, the AntMaze dataset involves a complex state space and requires learning and understanding high-dimensional policies. We observed that our method DIAR, consistently demonstrated strong performance in these challenging tasks, where high-dimensional abstraction and reasoning are critical. For more demonstrations, please refer to the Appendix F.

5.2 Impact of Adaptive Revaluation

In this section, we analyze the impact of Adaptive Revaluation. We directly compare the cases where Adaptive Revaluation is used and not used in our model. The test is conducted on long-horizon sparse reward tasks, where rewards are sparse. For overall training, an expectile value of τ=0.9𝜏0.9\tau=0.9italic_τ = 0.9 was used, with H=30𝐻30H=30italic_H = 30 for Maze2D and H=20𝐻20H=20italic_H = 20 for AntMaze and Kitchen. Other training settings were generally the same, and detailed configurations can be found in the Appendix A.

Refer to caption
(a) umaze w/o AR
Refer to caption
(b) medium w/o AR
Refer to caption
(c) large w/o AR
Refer to caption
(d) umaze w/ AR
Refer to caption
(e) medium w/ AR
Refer to caption
(f) large w/ AR
Figure 6: (a)similar-to\sim(c) Three Maze2D results that only the Q-function is used without Adaptive Revaluation. (d)similar-to\sim(f) Three Maze2D results for improved decision making using Adaptive Revaluation. Even without Adaptive Revaluation, our model performs well, but we can observe that using Adaptive Revaluation enables more efficient decision-making.

When Adaptive Revaluation is used, it checks whether a better decision might exist according to the value function and discovers a better latent vector to re-create the trajectory. If the value of the current state is higher than the value of a future state, it indicates that a better trajectory might exist than the currently selected decision. This enables the agent to choose a more accurate abstraction and form a more optimal trajectory based on it. The improvement in decision-making with Adaptive Revaluation can be observed in Table 2, which shows how much the agent’s decisions improve when using this method.

Table 2: Comparison of performance changes with Adaptive Revaluation (AR) in D4RL tasks.
Dataset DIAR w/o AR DIAR w/ AR
maze2d-umaze-v1 135.6±2.8plus-or-minus2.8\pm 2.8± 2.8 141.8±plus-or-minus\pm±4.3
maze2d-medium-v1 138.2±3.1plus-or-minus3.1\pm 3.1± 3.1 139.2±plus-or-minus\pm±3.5
maze2d-large-v1 193.5±4.7plus-or-minus4.7\pm 4.7± 4.7 200.3±plus-or-minus\pm±3.4
antmaze-umaze-diverse-v2 88.8±plus-or-minus\pm±1.5 85.4±plus-or-minus\pm±2.6
antmaze-medium-diverse-v2 68.2±plus-or-minus\pm±6.7 67.4±plus-or-minus\pm±3.4
antmaze-large-diverse-v2 56.0±plus-or-minus\pm±4.6 60.6±plus-or-minus\pm±2.4
kitchen-complete-v0 68.8±plus-or-minus\pm±2.1 63.8±plus-or-minus\pm±3.0
kitchen-partial-v0 63.3±plus-or-minus\pm±0.9 63.0±plus-or-minus\pm±2.5
kitchen-mixed-v0 60.0±plus-or-minus\pm±0.7 60.8±plus-or-minus\pm±1.4

5.3 Comparison with Skill Latent Models

We further compare our model with other reinforcement learning methods that use skill latents. For the D4RL tasks, we selected methods that use generative models to learn skills and make decisions based on them. As performance baselines, we chose the VAE-based methods OPAL111To compare its effect on implicit learning, we refer to the results from Yang et al. (2023). (Ajay et al., 2021) and PLAS (Zhou et al., 2020), as well as Flow2Control (Yang et al., 2023), which utilizes normalizing flows. The performance comparison is shown in Table 3.

Table 3: Performance comparison with other skill latent learning methods in D4RL tasks.
Dataset BC PLAS IQL+OPAL Flow2Control DIAR
maze2d-umaze-v1 3.8 57.0 - - 141.8±plus-or-minus\pm±4.3
maze2d-medium-v1 30.3 36.5 - - 139.2±plus-or-minus\pm±3.5
maze2d-large-v1 5.0 122.7 - - 200.3±plus-or-minus\pm±3.4
antmaze-umaze-diverse-v2 45.6 45.3 70.2 81.6 88.8±plus-or-minus\pm±1.5
antmaze-medium-diverse-v2 0.0 0.7 42.8 83.7 68.2±plus-or-minus\pm±6.6
antmaze-large-diverse-v2 0.0 0.0 52.4 52.8 60.6±plus-or-minus\pm±2.4
kitchen-complete-v0 65.0 34.8 11.5 75.0 68.8±plus-or-minus\pm±2.1
kitchen-partial-v0 38.0 43.9 72.5 74.9 63.3±plus-or-minus\pm±0.9
kitchen-mixed-v0 51.5 40.8 65.7 69.2 60.8±plus-or-minus\pm±1.4

6 Conclusion

In this study, we proposed Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR), which leverages diffusion models to improve abstraction capabilities and train more adaptive agents in offline RL. First, we introduced an Adaptive Revaluation algorithm based on the value function, which allows for long-horizon predictions while enabling the agent to flexibly revise its decisions to discover more optimal ones. Second, we propose an Diffusion-model-guided Implicit Q-learning. Offline RL faces the limitation of difficulty in evaluating out-of-distribution state-action pairs, as it learns from a fixed dataset. By leveraging the diffusion model, a generative model, we balance the learning of the value function and Q-function to cover a broader range of cases. By combining these two methods, we achieved state-of-the-art performance in long-horizon sparse reward tasks such as Maze2D, AntMaze, and Kitchen. Our approach is particularly strong in long-horizon sparse reward situations, where it is challenging to assess the current value. Additionally, a key advantage of our method is that it performs well without requiring extensive hyper-parameter tuning for each task. We believe that the latent diffusion model holds significant strengths in offline RL and has high potential for applications in various fields such as robotics.

References

  • Agarwal et al. (2020) Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An Optimistic Perspective on Offline Reinforcement Learning. In ICML, 2020.
  • Ajay et al. (2021) Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning. In ICLR, 2021.
  • Ajay et al. (2023) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is Conditional Generative Modeling All You Need for Decision-Making? In ICLR, 2023.
  • Chen et al. (2021) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learning via Sequence Modeling. In NeurIPS, 2021.
  • Fu et al. (2019) Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing Bottlenecks in Deep Q-learning Algorithms. In arXiv:1902.10250, 2019.
  • Fu et al. (2020) Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219, 2020.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In ICML, 2018.
  • Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. Off-Policy Deep Reinforcement Learning without Exploration. In ICML, 2019.
  • Hansen-Estruch et al. (2023) Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies. In arXiv:2304.10573, 2023.
  • Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.
  • Hasselt et al. (2018) Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep Reinforcement Learning and the Deadly Triad. In arXiv:1812.02648, 2018.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In NeurIPS, 2020.
  • Jain & Ravanbakhsh (2023) Vineet Jain and Siamak Ravanbakhsh. Learning to Reach Goals via Diffusion. arXiv:2310.02505, 2023.
  • Janner et al. (2022) Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with Diffusion for Flexible Behavior Synthesis. In ICML, 2022.
  • Kalashnikov et al. (2018) Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In CoRL, 2018.
  • Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In ICLR, 2014.
  • Kostrikov et al. (2022) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline Reinforcement Learning with Implicit Q-Learning. In ICLR, 2022.
  • Kumar et al. (2019) Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In NeurIPS, 2019.
  • Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning. In NeurIPS, 2020.
  • Levine et al. (2020) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643, 2020.
  • Liang et al. (2024) Zhixuan Liang, Yao Mu, Hengbo Ma, Masayoshi Tomizuka, Mingyu Ding, and Ping Luo. SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution. In CVPR, 2024.
  • Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. In CVPR, 2022.
  • Pertsch et al. (2021) Karl Pertsch, Youngwoon Lee, and Joseph Lim. Accelerating Reinforcement Learning with Learned Skill Priors. In CoRL, 2021.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. In NeurIPS, 2022.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In NeurIPS, 2022.
  • Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. In ICLR, 2016.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In ICML, 2015.
  • Venkatraman et al. (2024) Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, and Glen Berseth. Reasoning with Latent Diffusion in Offline Reinforcement Learning. In ICLR, 2024.
  • Wang et al. (2023) Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning. In ICLR, 2023.
  • Yang et al. (2023) Yiqin Yang, Hao Hu, Wenzhe Li, Siyuan Li, Jun Yang, Qianchuan Zhao, and Chongjie Zhang. Flow to Control: Offline Reinforcement Learning with Lossless Primitive Discovery. In AAAI, 2023.
  • Zhou et al. (2020) Wenxuan Zhou, Sujay Bajracharya, and David Held. PLAS: Latent Action Space for Offline Reinforcement Learning. In CoRL, 2020.

Appendix A Experiments Details

DIAR consists of three main components: the β𝛽\betaitalic_β-VAE for learning latent skills, the latent diffusion model for learning distributions through latent vectors, and the Q-function, which learns the value of state-latent vector pairs and selects the best latent. These three models are trained sequentially, and when learning the same task, the earlier models can be reused. Detailed model settings and hyperparameters are discussed in the next section. For more detailed code implementation and process, you can refer directly to the code on GitHub.

A.1 β𝛽\betaitalic_β-Variational Autoencoder

The β𝛽\betaitalic_β-VAE consists of an encoder, policy decoder, state prior, and state decoder. The encoder uses two stacked bidirectional GRUs. The output of the GRU is used to compute the mean and standard deviation. Each GRU output is passed through an MLP to calculate the mean and standard deviation, which are then used to compute the latent vector. This latent vector is used by the state prior, state decoder, and policy decoder. The policy decoder takes the latent vector and the current state as input to predict the current action. The state decoder takes the latent vector and the current state to predict the future state. Lastly, the state prior learns the distribution of the latent vector for the current state, ensuring that the latent vector generated by the encoder is trained similarly through KL divergence.

In Maze2D, H=30𝐻30H=30italic_H = 30 is used; in AntMaze and Kitchen, H=20𝐻20H=20italic_H = 20 is used. The diffusion model for the diffusion prior used in β𝛽\betaitalic_β-VAE training employs a transformer architecture. This model differs from the latent diffusion model discussed in the next section, and they are trained independently. Training the β𝛽\betaitalic_β-VAE for too many epochs can lead to overfitting of the latent vector, which can negatively impact the next stage.

Table 4: Hyperparameters for VAE training
Hyperparameter Value
Learning rate 5e-5
Batch size 128
Epochs 100
Latent dimension 16
β𝛽\betaitalic_β 0.1
Diffusion prior steps 200
Optimizer Adam

A.2 Latent Diffusion Model

The generative model plays the role of learning the distribution of the latent vector for the current state. The current state and latent vector are concatenated and then re-encoded for use. The architecture of the diffusion model follows a U-Net structure, where the dimensionality decreases and then increases, with each block consisting of residual blocks. Unlike the traditional approach of predicting noise ϵitalic-ϵ\epsilonitalic_ϵ, the diffusion model is trained to directly predict the latent vector z𝑧zitalic_z. This process is constrained by Min-SNR-γ𝛾\gammaitalic_γ. Overall, the diffusion model operates similarly to the DDPM method.

Table 5: Hyperparameters for Diffusion model training
Hyperparameter Value
Learning rate 1e-4
Batch size 128
Epochs 450
Diffusion steps 500
Drop probability 0.1
Min-SNR (γ𝛾\gammaitalic_γ) 5
Optimizer Adam

A.3 Q-learning

In our approach, we utilize both a Q-network and a Value network. The Q-network follows the DDQN method, employing two networks that learn slowly according to the update ratio. The Value network uses a single network. Both the Q-network and the Value network are structured with repeated MLP layers. The Q-network encodes the state into a 256-dimensional vector and the latent vector into a 128-dimensional vector. These two vectors are concatenated and passed through additional MLP layers to compute the final Q-value. The Value network only encodes the state into a 256-dimensional vector, which is then used to compute the value. Between the linear layers, GELU activation functions and LayerNorm are applied. In this way, both the Q-network and Value network are implicitly trained under the guidance of the diffusion model.

Table 6: Hyperparameters for Q-learning
Hyperparameter Value
Learning rate 5e-4
Batch size 128
Discount factor (γ𝛾\gammaitalic_γ) 0.995
Target network update rate 0.995
PER buffer α𝛼\alphaitalic_α 0.7
PER buffer β𝛽\betaitalic_β 0.310.310.3\to 10.3 → 1
Number of latent samples 500
Expectile (τ𝜏\tauitalic_τ) 0.9
extra steps 5
Scheduler StepLR
Scheduler step 50
Scheduler γ𝛾\gammaitalic_γ 0.3
Optimizer Adam

Appendix B DIAR Policy Execution Details

We provide a detailed explanation of how DIAR performs policy execution. It primarily selects the latent with the highest Q-value. However, if the current state value V(𝒔t+h)𝑉subscript𝒔𝑡V(\bm{s}_{t+h})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ) is higher than the future state value V(𝒔s+H)𝑉subscript𝒔𝑠𝐻V(\bm{s}_{s+H})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_s + italic_H end_POSTSUBSCRIPT ), it triggers another search for a new latent. DIAR repeats this process until it either reaches the goal or the maximum step T𝑇Titalic_T is reached.

1 Input: environment Env, Q-network Q(𝒔,𝒂)𝑄𝒔𝒂Q(\bm{s},\bm{a})italic_Q ( bold_italic_s , bold_italic_a ), value-network V(𝒔)𝑉𝒔V(\bm{s})italic_V ( bold_italic_s ), policy decoder πθD(𝒂|𝒔,𝒛)subscript𝜋subscript𝜃𝐷conditional𝒂𝒔𝒛\pi_{\theta_{D}}(\bm{a}|\bm{s},\bm{z})italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a | bold_italic_s , bold_italic_z ), state decoder fθ(𝒔t+H|𝒔t,𝒛t)subscript𝑓𝜃conditionalsubscript𝒔𝑡𝐻subscript𝒔𝑡subscript𝒛𝑡f_{\theta}(\bm{s}_{t+H}|\bm{s}_{t},\bm{z}_{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), diffusion model μψ(𝒛|𝒔)subscript𝜇𝜓conditional𝒛𝒔\mu_{\psi}(\bm{z}|\bm{s})italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s ), horizon H, max step T, number of sampling latent vectors n, latent vector 𝒛𝒛\bm{z}bold_italic_z
2t0𝑡0t\leftarrow 0italic_t ← 0
3doneFalse𝑑𝑜𝑛𝑒Falsedone\leftarrow\textit{False}italic_d italic_o italic_n italic_e ← False
4while notdonenot𝑑𝑜𝑛𝑒\textup{not}\ donenot italic_d italic_o italic_n italic_e do
5       𝒔tEnvsubscript𝒔𝑡Env\bm{s}_{t}\leftarrow\textit{Env}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← Env
6      𝒛t0,𝒛t1,,𝒛tn1μψ(𝒛|𝒔)subscriptsuperscript𝒛0𝑡subscriptsuperscript𝒛1𝑡subscriptsuperscript𝒛𝑛1𝑡subscript𝜇𝜓conditional𝒛𝒔\bm{z}^{0}_{t},\bm{z}^{1}_{t},\ldots,\bm{z}^{n-1}_{t}\leftarrow\mu_{\psi}(\bm{% z}|\bm{s})bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s ) # Sampling latents vectors from diffusion model
7      Q(𝒔t,𝒛t0),Q(𝒔t,𝒛t1),,Q(𝒔t,𝒛tn1)Qη(𝒔,𝒛)𝑄subscript𝒔𝑡subscriptsuperscript𝒛0𝑡𝑄subscript𝒔𝑡subscriptsuperscript𝒛1𝑡𝑄subscript𝒔𝑡subscriptsuperscript𝒛𝑛1𝑡subscript𝑄𝜂𝒔𝒛Q(\bm{s}_{t},\bm{z}^{0}_{t}),Q(\bm{s}_{t},\bm{z}^{1}_{t}),\ldots,Q(\bm{s}_{t},% \bm{z}^{n-1}_{t})\leftarrow Q_{\eta}(\bm{s},\bm{z})italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , … , italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_Q start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_italic_s , bold_italic_z ) # Calculate Q value
8      𝒛tiargmax𝒛tiQ(𝒔t,𝒛ti),𝒛i{𝒛t0,𝒛t1,𝒛tn1}formulae-sequencesubscriptsuperscript𝒛𝑖𝑡subscriptargmaxsubscriptsuperscript𝒛𝑖𝑡𝑄subscript𝒔𝑡subscriptsuperscript𝒛𝑖𝑡superscript𝒛𝑖subscriptsuperscript𝒛0𝑡subscriptsuperscript𝒛1𝑡subscriptsuperscript𝒛𝑛1𝑡\bm{z}^{i}_{t}\leftarrow\operatorname*{arg\,max}\limits_{\bm{z}^{i}_{t}}Q(\bm{% s}_{t},\bm{z}^{i}_{t}),\ \bm{z}^{i}\in\{\bm{z}^{0}_{t},\bm{z}^{1}_{t},\ldots% \bm{z}^{n-1}_{t}\}bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ { bold_italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … bold_italic_z start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
9      𝒔t+Hfθ(𝒔t+H|𝒔t,𝒛ti)subscript𝒔𝑡𝐻subscript𝑓𝜃conditionalsubscript𝒔𝑡𝐻subscript𝒔𝑡subscriptsuperscript𝒛𝑖𝑡\bm{s}_{t+H}\leftarrow f_{\theta}(\bm{s}_{t+H}|\bm{s}_{t},\bm{z}^{i}_{t})bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) # Predict future state
10      
11      V(𝒔t+H)Vϕ(𝒔)𝑉subscript𝒔𝑡𝐻subscript𝑉italic-ϕ𝒔V(\bm{s}_{t+H})\leftarrow V_{\phi}(\bm{s})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) ← italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s ) # Calculate value of future state
12      
13      h00h\leftarrow 0italic_h ← 0
14      for h<H𝐻h<Hitalic_h < italic_H do
15            
16            𝒔t+hEnvsubscript𝒔𝑡Env\bm{s}_{t+h}\leftarrow\textit{Env}bold_italic_s start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ← Env
17            V(𝒔t+h)Vϕ(𝒔)𝑉subscript𝒔𝑡subscript𝑉italic-ϕ𝒔V(\bm{s}_{t+h})\leftarrow V_{\phi}(\bm{s})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ) ← italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s ) # Calculate value of current state
18            if V(𝐬t+H)<V(𝐬t+h)𝑉subscript𝐬𝑡𝐻𝑉subscript𝐬𝑡V(\bm{s}_{t+H})<V(\bm{s}_{t+h})italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT ) < italic_V ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ) then
19                   break
20             end if
21            
22            else
23                   𝒂t+hπθD(𝒂t+h|𝒔t+h,𝒛ti)subscript𝒂𝑡subscript𝜋subscript𝜃𝐷conditionalsubscript𝒂𝑡subscript𝒔𝑡subscriptsuperscript𝒛𝑖𝑡\bm{a}_{t+h}\leftarrow\pi_{\theta_{D}}(\bm{a}_{t+h}|\bm{s}_{t+h},\bm{z}^{i}_{t})bold_italic_a start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ← italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
24                  Execute action 𝒂t+hsubscript𝒂𝑡\bm{a}_{t+h}bold_italic_a start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT
25                  Update done by Env
26                  hh+11h\leftarrow h+1italic_h ← italic_h + 1
27             end if
28            
29       end for
30      tt+h𝑡𝑡t\leftarrow t+hitalic_t ← italic_t + italic_h
31 end while
Algorithm 2 DIAR Policy Execution

Appendix C Training Process for β𝛽\betaitalic_β-VAE

This section details the process by which the β𝛽\betaitalic_β-VAE is trained. The β𝛽\betaitalic_β-VAE consists of four models: the skill latent encoder, policy decoder, state decoder, and state prior. These four components are trained simultaneously. Additionally, a diffusion prior is trained alongside to guide the β𝛽\betaitalic_β-VAE in generating appropriate latent vectors. The detailed process can be found in Algorithm 3.

1 Input: Dataset 𝒟𝒟\mathcal{D}caligraphic_D, state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, epoch M𝑀Mitalic_M, horizon H𝐻Hitalic_H, diffusion steps T𝑇Titalic_T, Min-SNR γ𝛾\gammaitalic_γ, state prior pθs(𝒛t|𝒔t)subscript𝑝subscript𝜃𝑠conditionalsubscript𝒛𝑡subscript𝒔𝑡p_{{\theta_{s}}}(\bm{z}_{t}|\bm{s}_{t})italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), latent encoder qθE(𝒛t|𝒔t:t+H,𝒂t:t+H)subscript𝑞subscript𝜃𝐸conditionalsubscript𝒛𝑡subscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻q_{{\theta_{E}}}(\bm{z}_{t}|\bm{s}_{t:t+H},\bm{a}_{t:t+H})italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ), policy decoder πθD(𝒂t+i|𝒔t+i,𝒛t)subscript𝜋subscript𝜃𝐷conditionalsubscript𝒂𝑡𝑖subscript𝒔𝑡𝑖subscript𝒛𝑡\pi_{\theta_{D}}(\bm{a}_{t+i}|\bm{s}_{t+i},\bm{z}_{t})italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), state decoder fθ(𝒔t+H|𝒔t,𝒛t)subscript𝑓𝜃conditionalsubscript𝒔𝑡𝐻subscript𝒔𝑡subscript𝒛𝑡f_{\theta}(\bm{s}_{t+H}|\bm{s}_{t},\bm{z}_{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), β𝛽\betaitalic_β-VAE parameter θ𝜃\thetaitalic_θ, diffusion prior μψsubscript𝜇𝜓\mu_{\psi}italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, KL regularization coefficient β𝛽\betaitalic_β
2iter0𝑖𝑡𝑒𝑟0iter\leftarrow 0italic_i italic_t italic_e italic_r ← 0
3for iter<M𝑖𝑡𝑒𝑟𝑀iter<Mitalic_i italic_t italic_e italic_r < italic_M do
4       𝒔t:t+H,𝒂t:t+H𝒟subscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻𝒟\bm{s}_{t:t+H},\bm{a}_{t:t+H}\leftarrow\mathcal{D}bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ← caligraphic_D
5      𝒛tqθE(𝒛t|𝒔t:t+H,𝒂t:t+H)subscript𝒛𝑡subscript𝑞subscript𝜃𝐸conditionalsubscript𝒛𝑡subscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻\bm{z}_{t}\leftarrow q_{{\theta_{E}}}(\bm{z}_{t}|\bm{s}_{t:t+H},\bm{a}_{t:t+H})bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ) # Encoding latent vector
6      1i=0H1logπθD(𝒂t+i|𝒔t+i,𝒛t)subscript1superscriptsubscript𝑖0𝐻1subscript𝜋subscript𝜃𝐷conditionalsubscript𝒂𝑡𝑖subscript𝒔𝑡𝑖subscript𝒛𝑡\mathcal{L}_{1}\leftarrow-\sum_{i=0}^{H-1}\log\pi_{\theta_{D}}(\bm{a}_{t+i}|% \bm{s}_{t+i},\bm{z}_{t})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) # Reconstruction loss
7      2DKL(qθE(𝒛t|𝒔t:t+H,𝒂t:t+H)pθs(𝒛t|𝒔t))\mathcal{L}_{2}\leftarrow D_{KL}(q_{{\theta_{E}}}(\bm{z}_{t}|\bm{s}_{t:t+H},% \bm{a}_{t:t+H})\parallel p_{{\theta_{s}}}(\bm{z}_{t}|\bm{s}_{t}))caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) # KL divergence with state prior
8      3logfθ(𝒔t+H|𝒔t,𝒛t)subscript3subscript𝑓𝜃conditionalsubscript𝒔𝑡𝐻subscript𝒔𝑡subscript𝒛𝑡\mathcal{L}_{3}\leftarrow-\log f_{\theta}(\bm{s}_{t+H}|\bm{s}_{t},\bm{z}_{t})caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ← - roman_log italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t + italic_H end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) # State decoder loss
9      Noise latents 𝒛jsubscript𝒛𝑗\bm{z}_{j}bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from Gaussian noise, j𝒰[1,T]similar-to𝑗𝒰1𝑇j\sim\mathcal{U}[1,T]italic_j ∼ caligraphic_U [ 1 , italic_T ]
10      4min{SNR(j),γ}(𝒛tμψ(𝒛j,𝒔t,j)2)subscript4SNR𝑗𝛾superscriptnormsubscript𝒛𝑡subscript𝜇𝜓subscript𝒛𝑗subscript𝒔𝑡𝑗2\mathcal{L}_{4}\leftarrow\min\{\mathrm{SNR}(j),\gamma\}(\|\bm{z}_{t}-\mu_{\psi% }(\bm{z}_{j},\bm{s}_{t},j)\|^{2})caligraphic_L start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ← roman_min { roman_SNR ( italic_j ) , italic_γ } ( ∥ bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) # Diffusion prior loss
11      total1+β2+3+4subscript𝑡𝑜𝑡𝑎𝑙subscript1𝛽subscript2subscript3subscript4\mathcal{L}_{total}\leftarrow\mathcal{L}_{1}+\beta\mathcal{L}_{2}+\mathcal{L}_% {3}+\mathcal{L}_{4}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
12      Update θ𝜃\thetaitalic_θ to minimize totalsubscript𝑡𝑜𝑡𝑎𝑙\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT
13      iteriter+1𝑖𝑡𝑒𝑟𝑖𝑡𝑒𝑟1iter\leftarrow iter+1italic_i italic_t italic_e italic_r ← italic_i italic_t italic_e italic_r + 1
14 end for
Algorithm 3 Training Beta Variational Autoencoder

Appendix D Training Process for Latent Diffusion Model

This section also provides an in-depth explanation of how the latent diffusion model is trained. The goal of the latent diffusion model is to learn the distribution of latent vectors generated by the β𝛽\betaitalic_β-VAE. The latent diffusion model is trained by first converting the offline dataset into latent vectors using the encoder of the β𝛽\betaitalic_β-VAE, and then learning from these latent vectors. The detailed process can be found in Algorithm 4.

1 Input: Dataset 𝒟𝒟\mathcal{D}caligraphic_D, state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, epoch M𝑀Mitalic_M, horizon H𝐻Hitalic_H, diffusion steps T𝑇Titalic_T, Min-SNR γ𝛾\gammaitalic_γ, latent encoder qθE(𝒛|𝒔,𝒂)subscript𝑞subscript𝜃𝐸conditional𝒛𝒔𝒂q_{{\theta_{E}}}(\bm{z}|\bm{s},\bm{a})italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_s , bold_italic_a ), diffusion model μψsubscript𝜇𝜓\mu_{\psi}italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, variance schedule α1,αT,α¯1,α¯T,β1,βTsubscript𝛼1subscript𝛼𝑇subscript¯𝛼1subscript¯𝛼𝑇subscript𝛽1subscript𝛽𝑇\alpha_{1},\dots\alpha_{T},\bar{\alpha}_{1},\dots\bar{\alpha}_{T},\beta_{1},% \dots\beta_{T}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
2iter0𝑖𝑡𝑒𝑟0iter\leftarrow 0italic_i italic_t italic_e italic_r ← 0
3for iter<M𝑖𝑡𝑒𝑟𝑀iter<Mitalic_i italic_t italic_e italic_r < italic_M do
4       𝒔t:t+H,𝒂t:t+H𝒟subscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻𝒟\bm{s}_{t:t+H},\bm{a}_{t:t+H}\leftarrow\mathcal{D}bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ← caligraphic_D
5      𝒛tqθE(𝒛t|𝒔t:t+H,𝒂t:t+H)subscript𝒛𝑡subscript𝑞subscript𝜃𝐸conditionalsubscript𝒛𝑡subscript𝒔:𝑡𝑡𝐻subscript𝒂:𝑡𝑡𝐻\bm{z}_{t}\leftarrow q_{{\theta_{E}}}(\bm{z}_{t}|\bm{s}_{t:t+H},\bm{a}_{t:t+H})bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT ) # Encoding latent vector
6      Sample diffusion time j𝒰[1,T]similar-to𝑗𝒰1𝑇j\sim\mathcal{U}[1,T]italic_j ∼ caligraphic_U [ 1 , italic_T ]
7      Noise latents from Gaussian noise 𝒛j𝒩(α¯j𝒛t,(1α¯j)𝐈)similar-tosubscript𝒛𝑗𝒩subscript¯𝛼𝑗subscript𝒛𝑡1subscript¯𝛼𝑗𝐈\bm{z}_{j}\sim\mathcal{N}(\sqrt{\bar{\alpha}_{j}}\bm{z}_{t},(1-\bar{\alpha}_{j% })\mathbf{I})bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_I )
8      min{SNR(j),γ}(𝒛tμψ(𝒛j,𝒔t,j)2)SNR𝑗𝛾superscriptnormsubscript𝒛𝑡subscript𝜇𝜓subscript𝒛𝑗subscript𝒔𝑡𝑗2\mathcal{L}\leftarrow\min\{\mathrm{SNR}(j),\gamma\}(\|\bm{z}_{t}-\mu_{\psi}(% \bm{z}_{j},\bm{s}_{t},j)\|^{2})caligraphic_L ← roman_min { roman_SNR ( italic_j ) , italic_γ } ( ∥ bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_j ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) # Diffusion model loss
9      Update ψ𝜓\psiitalic_ψ to minimize \mathcal{L}caligraphic_L
10      iteriter+1𝑖𝑡𝑒𝑟𝑖𝑡𝑒𝑟1iter\leftarrow iter+1italic_i italic_t italic_e italic_r ← italic_i italic_t italic_e italic_r + 1
11 end for
Algorithm 4 Training Latent Diffusion Model

Appendix E Diffusion Probabilistic Models

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) function as latent variable generative models, formally expressed through the equation pθ(x0):=pθ(x0:T),dx1:Tassignsubscript𝑝𝜃subscript𝑥0subscript𝑝𝜃subscript𝑥:0𝑇𝑑subscript𝑥:1𝑇p_{\theta}(x_{0}):=\int p_{\theta}(x_{0:T}),dx_{1:T}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := ∫ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) , italic_d italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. Here, x1,,xTsubscript𝑥1subscript𝑥𝑇x_{1},\ldots,x_{T}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denote the sequence of latent variables, integral to the model’s capacity to assimilate and recreate the intricate distributions characteristic of high-dimensional data types like images and audio. In these models, the forward process q(xt|xt1)𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1q(x_{t}|x_{t-1})italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) methodically introduces Gaussian noise into the data, adhering to a predetermined variance schedule delineated by β1,,βTsubscript𝛽1subscript𝛽𝑇\beta_{1},\ldots,\beta_{T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This step-by-step addition of noise outlines the approximate posterior q(x1:T|x0)𝑞conditionalsubscript𝑥:1𝑇subscript𝑥0q(x_{1:T}|x_{0})italic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) within a structured mathematical formulation, which is specified as follows:

q(x1:T|x0):=t=1Tq(xt|xt1),q(xt|xt1):=𝒩(xt;1βtxt1,βtI)formulae-sequenceassign𝑞conditionalsubscript𝑥:1𝑇subscript𝑥0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1assign𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐼q(x_{1:T}|x_{0}):=\prod_{t=1}^{T}q(x_{t}|x_{t-1}),\quad q(x_{t}|x_{t-1}):=% \mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I)\quaditalic_q ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) (11)

The iterative denoising process, also known as the reverse process, enables sample generation from Gaussian noised data, denoted as p(xT)=𝒩(xT;0,I)𝑝subscript𝑥𝑇𝒩subscript𝑥𝑇0𝐼p(x_{T})=\mathcal{N}(x_{T};0,I)italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , italic_I ). This process is modeled using a Markov chain, where each step involves generating the sample of the subsequent stage from the sample of the previous stage based on conditional probabilities. The joint distribution of the model, pθ(x0:T)subscript𝑝𝜃subscript𝑥:0𝑇p_{\theta}(x_{0:T})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ), can be represented as follows:

pθ(x0:T):=p(xT)t=1Tpθ(xt1|xt),pθ(xt1|xt):=𝒩(xt1;μθ(xt,t),Σθ(xt,t))formulae-sequenceassignsubscript𝑝𝜃subscript𝑥:0𝑇𝑝subscript𝑥𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡assignsubscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡p_{\theta}(x_{0:T}):=p(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t}),\quad p_% {\theta}(x_{t-1}|x_{t}):=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{% \theta}(x_{t},t))\quaditalic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) := italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) (12)

In the Diffusion Probabilistic Model, training is conducted via a reverse process that meticulously reconstructs the original data from noise. This methodological framework allows the Diffusion model to exhibit considerable flexibility and potent performance capabilities. Recent studies have further demonstrated that applying the diffusion process within a latent space created by an autoencoder enhances fidelity and diversity in tasks such as image inpainting and class-conditional image synthesis. This advancement underscores the effectiveness of latent space methodologies in refining the capabilities of diffusion models for complex generative tasks (Rombach et al., 2022). In light of this, the application of conditions and guidance to the latent space enable diffusion models to function effectively and to exhibit strong generalization capabilities.

Appendix F Qualitative Demonstration through Maze2D Results

Following the main section, we report more results in the Maze2D environments. We qualitatively demonstrate that DIAR consistently generates favorable trajectories.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Maze2D-umaze
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Maze2D-medium
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(c) Maze2D-large
Figure 7: DIAR-generated trajectories in diverse Maze2D demonstration. DIAR reliably reaches the goal even from starting points (blue) that are far from the goal (red). It even exhibits significant advantages in cases where decisions involve longer horizons.