HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: environ

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2302.00284v2 [cs.LG] 12 Feb 2024

Selective Uncertainty Propagation in Offline RL

Sanath Kumar Krishnamurthy Stanford University Shrey Modi Indian Institute of Technology Bombay Tanmay Gangwani Amazon Sumeet Katariya Amazon Branislav Kveton Amazon Anshuka Rangi Amazon
Abstract

We consider the finite-horizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step hhitalic_h in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step hhitalic_h after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.

1 Introduction

We study the finite-horizon offline reinforcement learning (RL) problem, focusing on algorithms that adapt to instance hardness. At a high-level, we study algorithms that provide better guarantees for contextual bandit (CB) like instances while being able to plan in more dynamic RL-like instances.

Our work is motivated by real-world RL problems, such as user interaction with an e-commerce search engine (recommendation system). Here, the state can be a user query, and the action is the product recommendation from the engine. When the user wants to buy a particular product, the user often only enters a single product query unrelated to the previous one; thus resembling a sequence of CB problems. On the other hand, when the user explores products, the exploration queries are related through the user’s intent, and the recommendation system may want to steer the user toward the ideal product. Hence, this resembles the RL setting. This indicates the need to develop unified solutions that integrate CB and RL techniques – adapting to instance hardness. We now introduce the CB and RL frameworks in more detail.

Stochastic contextual bandits (CBs) [Langford and Zhang, 2008, Li et al., 2010] and finite-horizon reinforcement learning (RL) [Sutton, 1988, Williams, 1992, Sutton and Barto, 1998] are two fundamental frameworks for decision-making under uncertainty. In stochastic CBs, the environment samples the context and corresponding rewards (for each action) from a fixed but unknown distribution; the agent then observes the context and learns to select the most rewarding action conditioned on the context.

Finite-horizon RL is a generalization of CBs where contexts become states and a sequence of decisions are to be made over H𝐻Hitalic_H steps. Similar to the CB problem, at each step, the agent observes the current state, selects an action conditioned on the current state, and receives a reward sampled by the environment from a corresponding conditional distribution. However, unlike the CB problem, while the initial state is sampled from a fixed but unknown distribution, the next state at any step depends on the current state and the agent’s action. Hence, the agent can plan to attain high cumulative reward by learning to reach high-value future states.

Unfortunately, the fact that actions can influence future states implies that the agent needs to learn under state-distribution shifts making the RL setting much more statistically harder than CBs in the worst case. For example, [Foster et al., 2021] show that the worst-case sample complexity to learn a non-trivial offline RL policy is either polynomial in the state space size or exponential in other parameters.111[Foster et al., 2021] consider the discounted infinite horizon offline RL formulation. However, one should expect similar lower bounds for the finite horizon offline RL formulation. On the other hand, if actions do not influence next-state distributions at any step, the RL instance would be equivalent to solving H𝐻Hitalic_H stochastic CB instances. On such instances, offline bandit algorithms [Foster and Syrgkanis, 2019] would enjoy a polynomial sample complexity for policy learning with no dependence on state space size. Hence, for such instances, state-of-the-art offline RL algorithms such as pessimistic value function optimization [Jin et al., 2021] may be unnecessarily conservative.

We formalize this dichotomy and show that the statistical hardness of offline RL instances can be captured by the size of actions’ impact on the next state’s distribution. To show this, we consider the high-level structure of dynamic programming (DP) algorithms for offline RL [e.g. Jin et al., 2021]. DP algorithms construct a policy iteratively starting from the policy for the final step and ending by constructing the policy for the first step. At any step hhitalic_h, DP algorithms can be viewed to select the policy at step hhitalic_h that maximizes the treatment effect of deviating from the behavioral policy at step hhitalic_h after having optimized the policy for all future steps. The goal of this paper is to estimate and construct good confidence intervals for this treatment effect at step hhitalic_h.

Our primary focus is on confidence interval (CI) construction, which is motivated by the fact that many successful offline RL algorithms learn a policy that maximizes the lower bound of constructed CIs [Jin et al., 2021]. To account for estimation errors from future steps, standard methods for CI construction at any step propagate uncertainty from future steps to the current step hhitalic_h. This paper seeks to construct better CIs that adapt to instance hardness by selectively propagating uncertainty. In cases where all actions have zero estimated impact on next-state distributions, our procedure does not propagate any uncertainty from later steps and still constructs valid CIs for the treatment effect of deviating from the behavioral policy at step hhitalic_h after having optimized the policy for all future steps. It treats the instance like a CB problem – hence enjoying a polynomial sample complexity with no dependence on state space size for treatment effect estimation. For more dynamic instances, our procedure must unavoidably propagate more uncertainty from future steps in order to continue constructing valid CIs. In this way, we adapt to the hardness of the instance for CI construction at any step. We also show the benefits of this approach for offline policy learning by proposing an algorithm that optimizes our constructed CIs. Simple simulations further support our claim.

Related Work: Both bandits and RL have been studied extensively [Lattimore and Szepesvari, 2019, Sutton and Barto, 1998, Foster and Rakhlin, 2023]. In bandits, the focus has been on achieving higher statistical efficiency by using the reward distribution of actions [Garivier and Cappe, 2011], prior distribution of model parameters [Thompson, 1933, Agrawal and Goyal, 2012, Chapelle and Li, 2012, Russo et al., 2018], parametric structure [Dani et al., 2008, Abbasi-Yadkori et al., 2011, Agrawal and Goyal, 2013], or agnostic methods [Agarwal et al., 2014]. In RL, the focus has been on different means of learning to plan for longer horizons, such as the value function [Sutton, 1988], policy [Williams, 1992], or their combination [Sutton et al., 2000]. Just as in our work, causal inference insights have helped improve the statistical efficiency of both CB and RL algorithms [Krishnamurthy et al., 2018, Carranza et al., 2023, Syrgkanis and Zhan, 2023]. However, bridging the gap between bandits and RL is an exciting and relatively under-explored research direction. One way to define this gap is to argue that in bandit-like environments, the state never changes once initially sampled. These bandit-like environments can be viewed as a special case of the situation where actions do not impact next-state distributions. With bridging this gap as one motivation, [Zanette and Brunskill, 2019, Yin and Wang, 2021] have used variance-dependent Bernstein bounds to limit uncertainty propagation when there is a lack of next-step value function heterogeneity. Another approach is to define this gap in a binary fashion. Either there is no impact of actions on next state distributions, or we are in a dynamic MDP environment. In an online setting, [Zhang et al., 2022] develop hypothesis tests to differentiate between the two situations and then select the most appropriate exploration algorithm. While their higher-level framing is similar to ours and their approach is novel, their approach cannot outperform existing RL algorithms in MDP environments. By interpolating between the two regimes, we hope to outperform bandit and existing RL algorithms that either forgo planning or are too conservative in accounting for actions’ impact on next-state distributions.

2 Preliminaries

Setting: We consider an episodic Markov Decision Process (MDP) setting with state space 𝒳𝒳\mathcal{X}caligraphic_X, action space 𝒜𝒜\mathcal{A}caligraphic_A, horizon H𝐻Hitalic_H, and transition kernel P=(P(h))h=1H𝑃superscriptsubscriptsuperscript𝑃1𝐻P=({P}^{(h)})_{h=1}^{H}italic_P = ( italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. At every episode, the environment samples a starting state x(1)superscript𝑥1{x}^{(1)}italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and a set of realized rewards r=(r(h))h=1H𝑟superscriptsubscriptsuperscript𝑟1𝐻r=({r}^{(h)})_{h=1}^{H}italic_r = ( italic_r start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT from a fixed but unknown distribution D𝐷Ditalic_D. Here r(h)superscript𝑟{r}^{(h)}italic_r start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT is a map from 𝒳×𝒜𝒳𝒜\mathcal{X}\times\mathcal{A}caligraphic_X × caligraphic_A to [0,1]01[0,1][ 0 , 1 ]. For any states x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X and action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, P(h)(x|x,a)superscript𝑃conditionalsuperscript𝑥𝑥𝑎{P}^{(h)}(x^{\prime}|x,a)italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) denotes the probability density of transitioning to state xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT conditional on taking action a𝑎aitalic_a at state x𝑥xitalic_x during step hhitalic_h. A trajectory τ𝜏\tauitalic_τ is a sequence of states, actions, and rewards. That is, any trajectory τ𝜏\tauitalic_τ is given by τ=(x(h),a(h),r(h)(x(h),a(h)))h=1H𝜏superscriptsubscriptsuperscript𝑥superscript𝑎superscript𝑟superscript𝑥superscript𝑎1𝐻\tau=({x}^{(h)},{a}^{(h)},{r}^{(h)}({x}^{(h)},{a}^{(h)}))_{h=1}^{H}italic_τ = ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT.

A policy π𝜋\piitalic_π is a sequence of H𝐻Hitalic_H action sampling kernels {π(h)}h=1Hsuperscriptsubscriptsuperscript𝜋1𝐻\{{\pi}^{(h)}\}_{h=1}^{H}{ italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, where π(h)(a|x)superscript𝜋conditional𝑎𝑥{\pi}^{(h)}(a|x)italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_a | italic_x ) denotes the probability of sampling action a𝑎aitalic_a at state x𝑥xitalic_x during step hhitalic_h under the policy π𝜋\piitalic_π. We let D(π)𝐷𝜋D(\pi)italic_D ( italic_π ) denote the induced distribution over trajectories under the policy π𝜋\piitalic_π. For any policy π𝜋\piitalic_π, we define the (state-) value function Vπ(h):𝒳[0,Hh+1]:subscriptsuperscript𝑉𝜋𝒳0𝐻1{V}^{(h)}_{\pi}:\mathcal{X}\rightarrow[0,H-h+1]italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT : caligraphic_X → [ 0 , italic_H - italic_h + 1 ] at each step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] such that,

Vπ(h)(x)=𝔼D(π)[i=hHr(i)(x(i),a(i))|x(h)=x].subscriptsuperscript𝑉𝜋𝑥subscript𝔼𝐷𝜋delimited-[]conditionalsuperscriptsubscript𝑖𝐻superscript𝑟𝑖superscript𝑥𝑖superscript𝑎𝑖superscript𝑥𝑥{V}^{(h)}_{\pi}(x)=\mathbb{E}_{D(\pi)}\bigg{[}\sum_{i=h}^{H}{r}^{(i)}({x}^{(i)% },{a}^{(i)})\bigg{|}{x}^{(h)}=x\bigg{]}.italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x ) = blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT = italic_x ] . (1)

The value of policy π𝜋\piitalic_π is given by 𝔼D[Vπ(1)(x(1))]subscript𝔼𝐷delimited-[]subscriptsuperscript𝑉1𝜋superscript𝑥1\mathbb{E}_{D}[{V}^{(1)}_{\pi}({x}^{(1)})]blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ]. We can take expectation over D𝐷Ditalic_D instead of D(π)𝐷𝜋D(\pi)italic_D ( italic_π ) here since the only random variable in Vπ(1)(x(1))subscriptsuperscript𝑉1𝜋superscript𝑥1{V}^{(1)}_{\pi}({x}^{(1)})italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) is the initial state x(1)superscript𝑥1{x}^{(1)}italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT which does not depend on the choice of the policy π𝜋\piitalic_π.

For any step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], we let R(h)superscript𝑅{R}^{(h)}italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT be a function from 𝒳×𝒜𝒳𝒜\mathcal{X}\times\mathcal{A}caligraphic_X × caligraphic_A to [0,1]01[0,1][ 0 , 1 ] denoting the expected reward function for step hhitalic_h. That is, R(h)(x,a)=𝔼D[r(h)(x,a)]superscript𝑅𝑥𝑎subscript𝔼𝐷delimited-[]superscript𝑟𝑥𝑎{R}^{(h)}(x,a)=\mathbb{E}_{D}[{r}^{(h)}(x,a)]italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) ]. With some abuse of notation, for any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we let R(h)(x,π)=aπ(h)(a|x)R(h)(x,a)superscript𝑅𝑥𝜋subscript𝑎superscript𝜋conditional𝑎𝑥superscript𝑅𝑥𝑎{R}^{(h)}(x,\pi)=\sum_{a}{\pi}^{(h)}(a|x){R}^{(h)}(x,a)italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_π ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_a | italic_x ) italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) and P(h)(x|x,π)=aπ(h)(a|x)P(h)(x|x,a)superscript𝑃conditionalsuperscript𝑥𝑥𝜋subscript𝑎superscript𝜋conditional𝑎𝑥superscript𝑃conditionalsuperscript𝑥𝑥𝑎{P}^{(h)}(x^{\prime}|x,\pi)=\sum_{a}{\pi}^{(h)}(a|x){P}^{(h)}(x^{\prime}|x,a)italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_a | italic_x ) italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ). That is, R(h)(x,π)superscript𝑅𝑥𝜋{R}^{(h)}(x,\pi)italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_π ) is the expected reward at state x𝑥xitalic_x and step hhitalic_h under the policy π𝜋\piitalic_π. Similarly, P(h)(x|x,π)superscript𝑃conditionalsuperscript𝑥𝑥𝜋{P}^{(h)}(x^{\prime}|x,\pi)italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) is the expected transition probability from x𝑥xitalic_x to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at step hhitalic_h under the policy π𝜋\piitalic_π. For any step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], we also let Vmax(h)superscriptsubscript𝑉max{V_{\text{max}}}^{(h)}italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT denote a bound on the maximum value Vπ(h)(x)subscriptsuperscript𝑉𝜋𝑥{V}^{(h)}_{\pi}(x)italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x ) can take for any state x𝑥xitalic_x and policy π𝜋\piitalic_π.

It is also equivalent to define the value functions (Vπ(h)subscriptsuperscript𝑉𝜋{V}^{(h)}_{\pi}italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT) using the iterative definition in (2), where Vπ(H+1)0superscriptsubscript𝑉𝜋𝐻10{V_{\pi}}^{(H+1)}\equiv 0italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_H + 1 ) end_POSTSUPERSCRIPT ≡ 0.

h[H],x𝒳,formulae-sequencefor-alldelimited-[]𝐻𝑥𝒳\displaystyle\forall h\in[H],x\in\mathcal{X},∀ italic_h ∈ [ italic_H ] , italic_x ∈ caligraphic_X , (2)
Vπ(h)(x)=R(h)(x,π)+xVπ(h+1)(x)P(h)(x|x,π).superscriptsubscript𝑉𝜋𝑥superscript𝑅𝑥𝜋subscriptsuperscript𝑥superscriptsubscript𝑉𝜋1superscript𝑥superscript𝑃conditionalsuperscript𝑥𝑥𝜋\displaystyle{V_{\pi}}^{(h)}(x)={R}^{(h)}(x,\pi)+\int_{x^{\prime}}{V_{\pi}}^{(% h+1)}(x^{\prime}){P}^{(h)}(x^{\prime}|x,\pi).italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) = italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_π ) + ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) .

Data Collection Process: In this paper, we focus on the offline setting [Levine et al., 2020] with training data collected under a behavioral policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Apart from the policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, the learner only has access to a dataset S𝑆Sitalic_S consisting of T𝑇Titalic_T trajectories sampled from the distribution D(πb)𝐷subscript𝜋𝑏D(\pi_{b})italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ), where D(πb)𝐷subscript𝜋𝑏D(\pi_{b})italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) is the data sampling distribution induced by πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. That is, S={τt}t=1T𝑆superscriptsubscriptsubscript𝜏𝑡𝑡1𝑇S=\{\tau_{t}\}_{t=1}^{T}italic_S = { italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where τt=(xt(h),at(h),rt(h)(x(h),a(h)))h=1HD(πb)subscript𝜏𝑡superscriptsubscriptsubscriptsuperscript𝑥𝑡subscriptsuperscript𝑎𝑡subscriptsuperscript𝑟𝑡superscript𝑥superscript𝑎1𝐻similar-to𝐷subscript𝜋𝑏\tau_{t}=({x}^{(h)}_{t},{a}^{(h)}_{t},{r}^{(h)}_{t}({x}^{(h)},{a}^{(h)}))_{h=1% }^{H}\sim D(\pi_{b})italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∼ italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). Since the transitions in these trajectories are induced by the behavioral policy, for notational convenience, we let Pb=(Pb(h))h=1Hsubscript𝑃𝑏superscriptsubscriptsuperscriptsubscript𝑃𝑏1𝐻P_{b}=({P_{b}}^{(h)})_{h=1}^{H}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = ( italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT denote the transition kernel under the policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. That is, for any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, Pb(h)(x|x)=P(h)(x|x,πb)superscriptsubscript𝑃𝑏conditionalsuperscript𝑥𝑥superscript𝑃conditionalsuperscript𝑥𝑥subscript𝜋𝑏{P_{b}}^{(h)}(x^{\prime}|x)={P}^{(h)}(x^{\prime}|x,\pi_{b})italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x ) = italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ).

2.1 Estimand of Interest

We now turn our attention to defining our target estimand, which refers to the specific quantity we aim to estimate. Consider a fixed policy π𝜋\piitalic_π and suppose we would like to estimate its value. Since estimating the value of the behavioral policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is easy (empirical average of total observed reward in each trajectory), we argue that that it is sufficient to estimate 𝔼D[Vπ(1)(x(1))Vπb(1)(x(1))]subscript𝔼𝐷delimited-[]subscriptsuperscript𝑉1𝜋superscript𝑥1subscriptsuperscript𝑉1subscript𝜋𝑏superscript𝑥1\mathbb{E}_{D}[{V}^{(1)}_{\pi}({x}^{(1)})-{V}^{(1)}_{\pi_{b}}({x}^{(1)})]blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] – the difference in values between evaluation and behavioral policy. This difference can be further decomposed. For each step hhitalic_h, let π~h=(πb(1),,πb(h1),π(h),,π(H))subscript~𝜋superscriptsubscript𝜋𝑏1superscriptsubscript𝜋𝑏1superscript𝜋superscript𝜋𝐻\tilde{\pi}_{h}=({\pi_{b}}^{(1)},\dots,{\pi_{b}}^{(h-1)},{\pi}^{(h)},\dots,{% \pi}^{(H)})over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h - 1 ) end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT ( italic_H ) end_POSTSUPERSCRIPT ) be the policy that follows the behavioral policy upto step h11h-1italic_h - 1 and then follows the evaluation policy. In (3), we decompose the difference in policy value between the evaluation and behavioral policy into the sum of differences in policy value between π~hsubscript~𝜋\tilde{\pi}_{h}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and π~h+1subscript~𝜋1\tilde{\pi}_{h+1}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT for each step hhitalic_h.

𝔼D[Vπ(1)(x(1))Vπb(1)(x(1))]subscript𝔼𝐷delimited-[]subscriptsuperscript𝑉1𝜋superscript𝑥1subscriptsuperscript𝑉1subscript𝜋𝑏superscript𝑥1\displaystyle\mathbb{E}_{D}[{V}^{(1)}_{\pi}({x}^{(1)})-{V}^{(1)}_{\pi_{b}}({x}% ^{(1)})]blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] (3)
=(i)h=1H𝔼D[Vπ~h(1)(x(1))Vπ~h+1(1)(x(1))]superscript𝑖absentsuperscriptsubscript1𝐻subscript𝔼𝐷delimited-[]subscriptsuperscript𝑉1subscript~𝜋superscript𝑥1subscriptsuperscript𝑉1subscript~𝜋1superscript𝑥1\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\sum_{h=1}^{H}\mathbb{E}_{D}[{V}% ^{(1)}_{\tilde{\pi}_{h}}({x}^{(1)})-{V}^{(1)}_{\tilde{\pi}_{h+1}}({x}^{(1)})]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ]
=(ii)h=1H𝔼D(πb)[Vπ~h(h)(x(h))Vπ~h+1(h)(x(h))].superscript𝑖𝑖absentsuperscriptsubscript1𝐻subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑉subscript~𝜋superscript𝑥subscriptsuperscript𝑉subscript~𝜋1superscript𝑥\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\sum_{h=1}^{H}\mathbb{E}_{D(\pi% _{b})}[{V}^{(h)}_{\tilde{\pi}_{h}}({x}^{(h)})-{V}^{(h)}_{\tilde{\pi}_{h+1}}({x% }^{(h)})].start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ] .

Here (i) follows from telescoping and (ii) follows from the fact that the policies π~hsubscript~𝜋\tilde{\pi}_{h}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and π~h+1subscript~𝜋1\tilde{\pi}_{h+1}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT agree with the behavioral policy for the first h11h-1italic_h - 1 steps. We let απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, the term corresponding to step hhitalic_h in the above decomposition, be our estimand of interest. That is, our estimand (απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT) is the difference in value of policies π~hsubscript~𝜋\tilde{\pi}_{h}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and π~h+1subscript~𝜋1\tilde{\pi}_{h+1}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT – these policies only differ in the current step hhitalic_h, which may cause difference in immediate rewards and may also cause a difference in in next-state distributions (affecting future rewards even if the policies at future steps are the same).

απ(h)=𝔼D(πb)[Vπ~h(h)(x(h))Vπ~h+1(h)(x(h))]subscriptsuperscript𝛼𝜋subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑉subscript~𝜋superscript𝑥subscriptsuperscript𝑉subscript~𝜋1superscript𝑥\displaystyle{\alpha}^{(h)}_{\pi}=\mathbb{E}_{D(\pi_{b})}[{V}^{(h)}_{\tilde{% \pi}_{h}}({x}^{(h)})-{V}^{(h)}_{\tilde{\pi}_{h+1}}({x}^{(h)})]italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ] (4)

We now seek to justify απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT as an important estimand, and start by arguing that it is a reasonable estimand to care about. Note that, given the decomposition in (3), estimating and constructing CIs for {απ(h)}h=1Hsuperscriptsubscriptsubscriptsuperscript𝛼𝜋1𝐻\{{\alpha}^{(h)}_{\pi}\}_{h=1}^{H}{ italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT allows us to estimate and construct CIs for 𝔼D[Vπ(1)(x(1))Vπb(1)(x(1))]subscript𝔼𝐷delimited-[]subscriptsuperscript𝑉1𝜋superscript𝑥1subscriptsuperscript𝑉1subscript𝜋𝑏superscript𝑥1\mathbb{E}_{D}[{V}^{(1)}_{\pi}({x}^{(1)})-{V}^{(1)}_{\pi_{b}}({x}^{(1)})]blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] (the difference in policy value between evaluation and behavioral policies) – and thus allows us to estimate and construct CIs for 𝔼D[Vπ(1)(x(1))]subscript𝔼𝐷delimited-[]subscriptsuperscript𝑉1𝜋superscript𝑥1\mathbb{E}_{D}[{V}^{(1)}_{\pi}({x}^{(1)})]blackboard_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] (evaluation policy value).

Beyond being an effective surrogate for policy evaluation, απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is an important quantity to consider in dynamic programming (DP) algorithms. DP algorithms construct the policy for the final step (π(H)superscript𝜋𝐻{\pi}^{(H)}italic_π start_POSTSUPERSCRIPT ( italic_H ) end_POSTSUPERSCRIPT) and iteratively construct policies for earlier steps. At step hhitalic_h, the policy at steps h+11h+1italic_h + 1 to H𝐻Hitalic_H are already fixed/computed. Hence at this step, one can interpret DP algorithms as attempting to select π(h)superscript𝜋{\pi}^{(h)}italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT in order to maximize απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT – that is, maximize the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Hence, for any step hhitalic_h, απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is a helpful estimand to consider for decision-making at step hhitalic_h.

Importantly for us, when actions at step hhitalic_h do not affect next state distributions, the problem of choosing a policy at step hhitalic_h can be viewed as a CB problem. Helpfully in this case, unlike policy value, απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT only depends on immediate rewards and can be estimated via offline stochastic CB techniques. However, when actions at step hhitalic_h do influence next state distributions, RL techniques are necessary for estimating απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. Hence, beyond being a critical quantity for decision-making at step hhitalic_h, it is also a quantity that is amenable to interpolating between CB and RL techniques. Thus, our paper focuses on estimating and constructing tight confidence intervals (CIs) for this estimand (απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT).

3 Shift Model

Offline RL is more challenging than offline policy learning in the stochastic CB setting [Foster et al., 2021]. The primary reason for the difference between the two settings is due to state distribution shift induced due to deviating from the behavioral policy. Distribution shift makes any statistical learning theory problem challenging [Vergara et al., 2012, Bobu et al., 2018, Farshchian et al., 2018]. Hence methods that adapt to instance hardness must rely on some implicit or explicit approach to measure this state-distribution shift. To this end, we model the “heterogeneous treatment effect" [Künzel et al., 2019, Nie and Wager, 2021] of actions on the next-state distribution and refer to this effect as the “shift model". More precisely, we define the shift model Δ=(Δ(h))h=1HΔsuperscriptsubscriptsuperscriptΔ1𝐻\Delta=({\Delta}^{(h)})_{h=1}^{H}roman_Δ = ( roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT in (5).

(x,a),Δ(h)(|x,a)=P(h)(|x,a)Pb(h)(|x).\displaystyle\forall(x,a),{\Delta}^{(h)}(\cdot|x,a)={P}^{(h)}(\cdot|x,a)-{P_{b% }}^{(h)}(\cdot|x).∀ ( italic_x , italic_a ) , roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x , italic_a ) = italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x , italic_a ) - italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ) . (5)

Here Δ(h)(x|x,a)superscriptΔconditionalsuperscript𝑥𝑥𝑎\Delta^{(h)}(x^{\prime}|x,a)roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) captures the shift in the probability of transitioning from x𝑥xitalic_x to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT due to selecting action a𝑎aitalic_a at state x𝑥xitalic_x instead of following the behavioral policy. With some abuse of notation, for any x,x𝒳𝑥superscript𝑥𝒳x,x^{\prime}\in\mathcal{X}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X, we let Δ(h)(x|x,π)=aπ(h)(a|x)Δ(h)(x|x,a)superscriptΔconditionalsuperscript𝑥𝑥𝜋subscript𝑎superscript𝜋conditional𝑎𝑥superscriptΔconditionalsuperscript𝑥𝑥𝑎{\Delta}^{(h)}(x^{\prime}|x,\pi)=\sum_{a}{\pi}^{(h)}(a|x){\Delta}^{(h)}(x^{% \prime}|x,a)roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_a | italic_x ) roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ). That is, Δ(h)(x|x,π)superscriptΔconditionalsuperscript𝑥𝑥𝜋{\Delta}^{(h)}(x^{\prime}|x,\pi)roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) is the expected shift (w.r.t to Pbsubscript𝑃𝑏P_{b}italic_P start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT) in probability of transitioning from x𝑥xitalic_x to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at step hhitalic_h under the policy π𝜋\piitalic_π. It is worth noting that shifts are bounded. For all (x,a)𝑥𝑎(x,a)( italic_x , italic_a ), since the Δ(h)(|x,a)\Delta^{(h)}(\cdot|x,a)roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x , italic_a ) is a difference of two state-distributions, we have Δ(h)(|x,a)12\|\Delta^{(h)}(\cdot|x,a)\|_{1}\leq 2∥ roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 from triangle inequality.

We argue that shift helps capture instance hardness for estimating απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. To see this, we provide a shift-dependent expression for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.

απ(h)subscriptsuperscript𝛼𝜋\displaystyle{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT =(i)𝔼D(πb)[Vπ~h(h)(x(h))Vπ~h+1(h)(x(h))]superscript𝑖absentsubscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑉subscript~𝜋superscript𝑥subscriptsuperscript𝑉subscript~𝜋1superscript𝑥\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\mathbb{E}_{D(\pi_{b})}[{V}^{(h)% }_{\tilde{\pi}_{h}}({x}^{(h)})-{V}^{(h)}_{\tilde{\pi}_{h+1}}({x}^{(h)})]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ] (6)
=(ii)𝔼D(πb)[R(h)(x(h),π)R(h)(x(h),πb)]superscript𝑖𝑖absentsubscript𝔼𝐷subscript𝜋𝑏delimited-[]superscript𝑅superscript𝑥𝜋superscript𝑅superscript𝑥subscript𝜋𝑏\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\mathbb{E}_{D(\pi_{b})}[{R}^{(h% )}({x}^{(h)},\pi)-{R}^{(h)}({x}^{(h)},\pi_{b})]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) - italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ]
+𝔼D(πb)[xVπ(h+1)(x)Δ(h)(x|x(h),π)].subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑥superscriptsubscript𝑉𝜋1superscript𝑥superscriptΔconditionalsuperscript𝑥superscript𝑥𝜋\displaystyle+\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}{V_{\pi}}^{(h+1)% }(x^{\prime}){\Delta}^{(h)}(x^{\prime}|{x}^{(h)},\pi)\bigg{]}.+ blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] .

Here (i) follows from (4) (definition of απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT); and (ii) follows from (2) and (5). Note that, in the final expression of (6), the first term can be estimated using stochastic CB techniques and the dependence on next-step value function is scaled by the size of this shift. This hints at the possibility of developing methods that interpolate between CB and RL techniques. More formally, in Section 4, we show shift estimates enable us to adapt to the hardness of our setting – when estimating and constructing CIs for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.

4 Theory: Selective Propagation

In Section 2, we motivated and defined our estimand απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (see (4)) – which is the treatment effect for deviating from the behavioral policy at step hhitalic_h after having already deviated from the behavioral policy for all future steps. We now present an approach to estimate and construct tight valid CIs for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT – with interval size adapting to instance hardness. Here harder instances have a larger next-state distribution shifts when deviating from the behavioral policy. When shifts are smaller, we can rely more on statistically efficient CB methods. However when shifts are larger (instance is more dynamic), we unavoidably must rely more on RL methods that account for worst-case distribution shifts.

Our approach to estimate and construct tight valid CIs for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT requires several inputs. These inputs, described in Section 4.1, allow us to abstract away existing approaches to tackle well-studied estimation problems in CB and RL settings. In Section 4.2, we describe how to combine these existing tools to achieve guarantees that adapt to instance hardness.

4.1 Inputs

Our method interpolates between existing tools for CB and RL settings, by leveraging shift estimates. To simplify our analysis and generalize our results, we assume access to these estimates as inputs to our interpolation method. In particular, we take as input: (1) offline CB treatment effect estimate and corresponding CI, (2) optimistic and pessimistic offline RL value function estimates, and (3) shift estimates with average error bounds. As the quality of our inputs improve (potentially as better estimators get developed), the quality of our outputs will correspondingly improve.

We now formally describe these inputs – requiring all the associated high-probability bounds to hold simultaneously with probability at least 1δin1subscript𝛿in1-\delta_{\text{in}}1 - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT. We start by describing the first input, which is based on CB methods.

Input 1 (CB estimates): This input provides an estimate and CI for θ(h)superscript𝜃{\theta}^{(h)}italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT (formally defined in (7)) – which is the average treatment effect on the immediate reward for deviating from the behavioral policy at step hhitalic_h.

θπ(h)=𝔼D(πb)[R(h)(x(h),π~h)R(h)(x(h),π~h+1)]subscriptsuperscript𝜃𝜋subscript𝔼𝐷subscript𝜋𝑏delimited-[]superscript𝑅superscript𝑥subscript~𝜋superscript𝑅superscript𝑥subscript~𝜋1{\theta}^{(h)}_{\pi}=\mathbb{E}_{D(\pi_{b})}\Big{[}{R}^{(h)}({x}^{(h)},\tilde{% \pi}_{h})-{R}^{(h)}({x}^{(h)},\tilde{\pi}_{h+1})\Big{]}italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ] (7)

Since θπ(h)subscriptsuperscript𝜃𝜋{\theta}^{(h)}_{\pi}italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT only depends on the immediate reward, well-established offline CB techniques [e.g., Dudik et al., 2014] can be used to estimate and construct CIs for the difference (in terms of immediate rewards) between these policies. We let θ^π(h)subscriptsuperscript^𝜃𝜋{\hat{\theta}}^{(h)}_{\pi}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT be our input estimate and let κπ,θ(h)subscriptsuperscript𝜅𝜋𝜃{\kappa}^{(h)}_{\pi,\theta}italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT be the input CI radius. That is, the confidence interval is given by (8).

|θπ(h)θ^π(h)|κπ,θ(h)subscriptsuperscript𝜃𝜋subscriptsuperscript^𝜃𝜋subscriptsuperscript𝜅𝜋𝜃|{\theta}^{(h)}_{\pi}-{\hat{\theta}}^{(h)}_{\pi}|\leq{\kappa}^{(h)}_{\pi,\theta}| italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | ≤ italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT (8)

When deviating from the behavioral policy at step hhitalic_h has no impact on next-state distributions, the estimate and CI for θπ(h)subscriptsuperscript𝜃𝜋{\theta}^{(h)}_{\pi}italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT can be used as the estimate and CI for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. However, when there is an impact on next-state distributions, valid estimation and CI construction for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT requires us to propagate estimates and uncertainty from future steps to the current step. To enable this propagation, we take estimates for Vπ(h+1)superscriptsubscript𝑉𝜋1{V_{\pi}}^{(h+1)}italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT as our second input.

Input 2 (RL estimates): This input provides pessimistic, standard, and optimistic estimates for Vπ(h+1)superscriptsubscript𝑉𝜋1{V_{\pi}}^{(h+1)}italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT – denoted by V^π,p(h+1)superscriptsubscript^𝑉𝜋𝑝1{\hat{V}_{\pi,p}}^{(h+1)}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT, V^π(h+1)superscriptsubscript^𝑉𝜋1{\hat{V}_{\pi}}^{(h+1)}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT, and V^π,o(h+1)superscriptsubscript^𝑉𝜋𝑜1{\hat{V}_{\pi,o}}^{(h+1)}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT respectively – such that the ordering in (9) holds.222Note that (9) can be enforced during input construction. Recall that Vmax(h+1)superscriptsubscript𝑉max1{V_{\text{max}}}^{(h+1)}italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT denotes a bound on the maximum value Vπ(h+1)(x)subscriptsuperscript𝑉1𝜋𝑥{V}^{(h+1)}_{\pi}(x)italic_V start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x ) can take for any state x𝑥xitalic_x. Further, with high probability, we require (10) holds – that is, the true value function is bounded by the pessimistic and optimistic value function estimates.

x, 0V^π,p(h+1)(x)V^π(h+1)(x)V^π,o(h+1)(x)Vmax(h+1)for-all𝑥 0superscriptsubscript^𝑉𝜋𝑝1𝑥superscriptsubscript^𝑉𝜋1𝑥superscriptsubscript^𝑉𝜋𝑜1𝑥subscriptsuperscript𝑉1max\displaystyle\forall x,\;0\leq{\hat{V}_{\pi,p}}^{(h+1)}(x)\leq{\hat{V}_{\pi}}^% {(h+1)}(x)\leq{\hat{V}_{\pi,o}}^{(h+1)}(x)\leq{V}^{(h+1)}_{\text{max}}∀ italic_x , 0 ≤ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) ≤ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) ≤ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) ≤ italic_V start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT (9)
x,Vπ(h+1)(x)[V^π,p(h+1)(x),V^π,o(h+1)(x)]for-all𝑥superscriptsubscript𝑉𝜋1𝑥superscriptsubscript^𝑉𝜋𝑝1𝑥superscriptsubscript^𝑉𝜋𝑜1𝑥\displaystyle\forall x,\;{V_{\pi}}^{(h+1)}(x)\in[{\hat{V}_{\pi,p}}^{(h+1)}(x),% {\hat{V}_{\pi,o}}^{(h+1)}(x)]∀ italic_x , italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) ∈ [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) ] (10)

There is a large and growing literature on value function estimation in RL, including optimistic and pessimistic value function estimation that are designed to satisfy (10) [e.g., Martin et al., 2017, Wang et al., 2019, Jin et al., 2021]. Thus, we can employ the most cutting-edge methods to construct these next-step value function estimates.

Input 2 gave us estimates for Vπ(h+1)superscriptsubscript𝑉𝜋1{V_{\pi}}^{(h+1)}italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT (next-step value), which we may need to propagate to the current step hhitalic_h – when constructing an estimate and CI for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. Since our goal is to interpolate between tight CB guarantees and always valid RL guarantees, unlike traditional RL algorithms, we want to be selective in propagating next-step estimates/uncertainties. Our final input, shift estimates, allows us to only propagate these estimates when required – enabling our adaptation to instance hardness.

Input 3 (Shift estimates): This input provides a estimate for Δ(h)superscriptΔ{{\Delta}}^{(h)}roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT (see (5)) and an associated error bound – denoted by Δ^(h)superscript^Δ{\hat{\Delta}}^{(h)}over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT and κπ,Δ(h)subscriptsuperscript𝜅𝜋Δ{\kappa}^{(h)}_{\pi,\Delta}italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT respectively – such that (11) holds (recall from Section 3 that Δ(h)superscriptΔ{{\Delta}}^{(h)}roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT satisfies the same bound). We also require (12) holds with high-probability.

(x,a),Δ^(h)(|x,a)12\displaystyle\forall(x,a),\;\|{\hat{\Delta}}^{(h)}(\cdot|x,a)\|_{1}\leq 2∀ ( italic_x , italic_a ) , ∥ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 (11)
𝔼D(πb)Δ^(h)(|x(h),π)Δ(h)(|x(h),π)1κπ,Δ(h)\displaystyle\mathbb{E}_{D(\pi_{b})}\big{\|}{\hat{\Delta}}^{(h)}(\cdot|{x}^{(h% )},\pi)-{\Delta}^{(h)}(\cdot|{x}^{(h)},\pi)\big{\|}_{1}\leq{\kappa}^{(h)}_{\pi% ,\Delta}blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) - roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT (12)

Since the true shift model is a function of the true transition model (see Section 3), shift can be estimated via first estimating transition model and then calculating the treatment effect (due to deviating from the behavioral policy) of transitioning between any pair of states [see Moerland et al., 2023, for a survey on model-based RL and transition model estimation.]. 333As a treatment effect model, shift may also be estimated via heterogeneous treatment effect estimators [e.g., Nie and Wager, 2021, Künzel et al., 2019].

4.2 Combining Inputs

We now have all our required input estimates, and can state our main result (Theorem 4.1) on constructing an estimate and CI for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT – in a way that adapts to instance hardness.

Theorem 4.1.

Suppose we have: (1) CB inputs (θ^π(h),κπ,θ(h))subscriptsuperscriptnormal-^𝜃𝜋subscriptsuperscript𝜅𝜋𝜃({\hat{\theta}}^{(h)}_{\pi},{\kappa}^{(h)}_{\pi,\theta})( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT ); (2) RL inputs (V^π,p(h+1),V^π(h+1),V^π,o(h+1))superscriptsubscriptnormal-^𝑉𝜋𝑝1superscriptsubscriptnormal-^𝑉𝜋1superscriptsubscriptnormal-^𝑉𝜋𝑜1({\hat{V}_{\pi,p}}^{(h+1)},{\hat{V}_{\pi}}^{(h+1)},{\hat{V}_{\pi,o}}^{(h+1)})( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ) satisfying (9); and (3) shift inputs (Δ^(h),κπ,Δ(h))superscriptnormal-^normal-Δsubscriptsuperscript𝜅𝜋normal-Δ({\hat{\Delta}}^{(h)},{\kappa}^{(h)}_{\pi,\Delta})( over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT ) satisfying (11) – such that (8), (10), and (12) hold with probability at least 1δ𝑖𝑛1subscript𝛿𝑖𝑛1-\delta_{\text{in}}1 - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT.
Moreover, suppose we have a (holdout) dataset of T𝑇Titalic_T trajectories S={τt}t=1T𝑆superscriptsubscriptsubscript𝜏𝑡𝑡1𝑇S=\{\tau_{t}\}_{t=1}^{T}italic_S = { italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT – sampled from the distribution D(πb)𝐷subscript𝜋𝑏D(\pi_{b})italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) – that were not used for constructing input estimates.444Utilizing a holdout set for estimating and constructing CIs for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT allows us to treat our input estimates as fixed quantities (independent of the randomness in the sampled holdout dataset). Our estimate for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is then denoted by α^π(h)subscriptsuperscriptnormal-^𝛼𝜋{\hat{\alpha}}^{(h)}_{\pi}over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and given by (13).

α^π(h)=θ^π(h)+1Tt=1TxV^π(h+1)(x)Δ^(h)(x|xt(h),π)subscriptsuperscript^𝛼𝜋subscriptsuperscript^𝜃𝜋1𝑇superscriptsubscript𝑡1𝑇subscriptsuperscript𝑥superscriptsubscript^𝑉𝜋1superscript𝑥superscript^Δconditionalsuperscript𝑥subscriptsuperscript𝑥𝑡𝜋\displaystyle{\hat{\alpha}}^{(h)}_{\pi}={\hat{\theta}}^{(h)}_{\pi}+\frac{1}{T}% \sum_{t=1}^{T}\int_{x^{\prime}}{\hat{V}_{\pi}}^{(h+1)}(x^{\prime}){\hat{\Delta% }}^{(h)}(x^{\prime}|{x}^{(h)}_{t},\pi)over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) (13)

Now for some fixed δ>0𝛿0\delta>0italic_δ > 0, with probability at least 1δδ𝑖𝑛1𝛿subscript𝛿𝑖𝑛1-\delta-\delta_{\text{in}}1 - italic_δ - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT, we have the confidence interval in (14) holds.

|απ(h)α^π(h)|Lπ(h).subscriptsuperscript𝛼𝜋subscriptsuperscript^𝛼𝜋subscriptsuperscript𝐿𝜋\displaystyle|{{\alpha}}^{(h)}_{\pi}-{\hat{\alpha}}^{(h)}_{\pi}|\leq{L}^{(h)}_% {\pi}.| italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | ≤ italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT . (14)

Here Lπ(h)subscriptsuperscript𝐿𝜋{L}^{(h)}_{\pi}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is given by (15).

Lπ(h)=κπ,θ(h)+Vmax(h+1)κπ,Δ(h)+6Vmax(h+1)ln(4/δ)2Tsubscriptsuperscript𝐿𝜋subscriptsuperscript𝜅𝜋𝜃superscriptsubscript𝑉1subscriptsuperscript𝜅𝜋Δ6subscriptsuperscript𝑉14𝛿2𝑇\displaystyle{L}^{(h)}_{\pi}={\kappa}^{(h)}_{\pi,\theta}+{V_{\max}}^{(h+1)}{% \kappa}^{(h)}_{\pi,\Delta}+6{V}^{(h+1)}_{\max}\sqrt{\frac{\ln(4/\delta)}{2T}}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT + 6 italic_V start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG divide start_ARG roman_ln ( 4 / italic_δ ) end_ARG start_ARG 2 italic_T end_ARG end_ARG (15)
+1Tt=1Tx|𝔼D(πb)[Δ^(h)(x|xt(h),π)]|Γπ(h+1)(x)\displaystyle+\frac{1}{T}\sum_{t=1}^{T}\int_{x^{\prime}}|\mathbb{E}_{D(\pi_{b}% )}[{\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)}_{t},\pi)]|{\Gamma_{\pi}}^{(h+1)}(% x^{\prime})+ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) ] | roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

Here Γπ(h+1)superscriptsubscriptnormal-Γ𝜋1{\Gamma_{\pi}}^{(h+1)}roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT is the difference between the optimistic and pessimistic estimates – it captures the uncertainty in the next-step value function estimates. That is, for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, Γπ(h+1)(x)=V^π,o(h+1)(x)V^π,p(h+1)(x)superscriptsubscriptnormal-Γ𝜋1𝑥superscriptsubscriptnormal-^𝑉𝜋𝑜1𝑥superscriptsubscriptnormal-^𝑉𝜋𝑝1𝑥{\Gamma_{\pi}}^{(h+1)}(x)={\hat{V}_{\pi,o}}^{(h+1)}(x)-{\hat{V}_{\pi,p}}^{(h+1% )}(x)roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) = over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) - over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ).

One of the advantages of in-distribution supervised learning is that excess risk bounds only depend on complexity of hypothesis class (and number of training samples), with no dependence on size of feature space [see Shalev-Shwartz and Ben-David, 2014]. As discussed in Section 1, the statistical challenges of RL stem from the fact that learning under (state) distribution shifts is hard. For example, without additional assumptions, optimistic/pessimistic value function estimation have an unavoidable polynomial dependency on state-space size [Foster and Rakhlin, 2023]. Our goal is to avoid/minimize such dependencies when possible. The key benefit of Theorem 4.1 is that both our estimate (α^π(h))subscriptsuperscript^𝛼𝜋({\hat{\alpha}}^{(h)}_{\pi})( over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) and our CI width (Lπ(h))subscriptsuperscript𝐿𝜋({L}^{(h)}_{\pi})( italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) are “selective" in propagating/utilizing the RL estimates from input 2 – allowing us to only suffer from the slower worst-case RL estimation rates on harder instances. To better understand this, let us dig deeper into the terms in our CI width (Lπ(h)subscriptsuperscript𝐿𝜋{L}^{(h)}_{\pi}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT).

Note that κπ,θ(h)subscriptsuperscript𝜅𝜋𝜃{\kappa}^{(h)}_{\pi,\theta}italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT and κπ,Δ(h)subscriptsuperscript𝜅𝜋Δ{\kappa}^{(h)}_{\pi,\Delta}italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT (see Inputs 1 and 3) bound errors averaged under the behavioral policy state-distribution – that is, they bound in-distribution average errors. Hence, with appropriate inputs, the first two terms in Lπ(h)subscriptsuperscript𝐿𝜋{L}^{(h)}_{\pi}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT shrink quickly with no dependency on state-space size [Dudik et al., 2014, Shalev-Shwartz and Ben-David, 2014]. The third term in Lπ(h)subscriptsuperscript𝐿𝜋{L}^{(h)}_{\pi}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, which enjoys a 𝒪(log(1/δ)/T)𝒪1𝛿𝑇\mathcal{O}(\sqrt{\log(1/\delta)/T})caligraphic_O ( square-root start_ARG roman_log ( 1 / italic_δ ) / italic_T end_ARG ) rate, also shrinks quickly to zero and has no dependency on state-space size.

We now only need to discuss the fourth and final term in Lπ(h)subscriptsuperscript𝐿𝜋{L}^{(h)}_{\pi}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. Unlike the previous terms which bound in-distribution average errors, this term does depend on per-state (point-wise) errors (Γπ(h+1)(x)superscriptsubscriptΓ𝜋1superscript𝑥{\Gamma_{\pi}}^{(h+1)}(x^{\prime})roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )). The reason RL algorithms seek to bound per-state errors is because these guarantees do not depend on the state-distribution and are valid under any shift. We now illustrate how this robustness to state-distribution shift comes at a cost of larger error bounds, and argue that it is advantageous to scale these terms down with the estimated shift. First, as a sanity check, we show that this term is finite.

x|Δ^(h)(x|xt(h),π)|Γπ(h+1)(x)\displaystyle\int_{x^{\prime}}|{\hat{\Delta}}^{(h)}(x^{\prime}|{x_{t}}^{(h)},% \pi)|\cdot{\Gamma_{\pi}}^{(h+1)}(x^{\prime})∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) | ⋅ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (16)
(i)Δ^(h)(|xt(h),π)1Γπ(h+1)(ii)2Γπ(h+1).\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\|{\hat{\Delta}}^{(h)}(\cdot|% {x_{t}}^{(h)},\pi)\|_{1}\|{\Gamma_{\pi}}^{(h+1)}\|_{\infty}\stackrel{{% \scriptstyle(ii)}}{{\leq}}2\|{\Gamma_{\pi}}^{(h+1)}\|_{\infty}.start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP ∥ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP 2 ∥ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

where (i) follows from Hölder’s inequality, and (ii) follows from (11). Now that we know this term is finite, we can argue that it shrinks to zero. Since Γπ(h+1)(x)superscriptsubscriptΓ𝜋1superscript𝑥{\Gamma_{\pi}}^{(h+1)}(x^{\prime})roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) captures the size of per-state errors for pessimistic/optimistic value function estimates, we can expect this term to converges to zero in the limit with infinite data. The size of Γπ(h+1)(x)superscriptsubscriptΓ𝜋1superscript𝑥{\Gamma_{\pi}}^{(h+1)}(x^{\prime})roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) must depend on how often states similar to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT were visited at step h+11h+1italic_h + 1 in the training data for the RL input. For simplicity, let us consider the scenario when all states are visited uniformly at step h+11h+1italic_h + 1 under the distribution D(πb)𝐷subscript𝜋𝑏D(\pi_{b})italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). In such a scenario, the frequency at which states similar to xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT were visited at step h+11h+1italic_h + 1 would depend on some measure of the size of the state space 𝒳𝒳\mathcal{X}caligraphic_X. This would imply that Γπ(h+1)(x)superscriptsubscriptΓ𝜋1superscript𝑥{\Gamma_{\pi}}^{(h+1)}(x^{\prime})roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) shrinks at a rate that depends on some measure of the size of the state space 𝒳𝒳\mathcal{X}caligraphic_X. That is, while these terms shrink, they shrink slowly. Hence per-state bounds, while independent of state-distribution, come at a cost of slower statistical rates. As shown in [Foster et al., 2021], such a dependence of confidence interval width on state-space size is unavoidable in the worst-case.

The key message of Theorem 4.1 is that we can move beyond this worst-case scenario by scaling these point-wise errors with the estimated shifts (Δ^(h))superscript^Δ({\hat{\Delta}}^{(h)})( over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ). For example, when Δ^(h)0superscript^Δ0{\hat{\Delta}}^{(h)}\equiv 0over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ≡ 0, the fourth term in Lπ(h)superscriptsubscript𝐿𝜋{L_{\pi}}^{(h)}italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT is zero, allowing us to recover contextual bandit-style guarantees that are independent of state-space size. It is worth noting that, even when state-space size is not a concern, being selective about error/estimate propagation can improve the resulting interval widths.

5 Selectively Pessimistic Value Iteration

Algorithm 1 Selectively Pessimistic Value Iteration
Inputs: Estimated reward model R^^𝑅\hat{R}over^ start_ARG italic_R end_ARG, estimated transition model P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG, induced shift model Δ^^Δ\hat{\Delta}over^ start_ARG roman_Δ end_ARG, and estimation uncertainty bonuses b𝑏bitalic_b.
Initialize V^(H+1),V^p(H+1),V^o(H+1)0superscript^𝑉𝐻1superscriptsubscript^𝑉𝑝𝐻1superscriptsubscript^𝑉𝑜𝐻10{\hat{V}}^{(H+1)},{\hat{V}_{p}}^{(H+1)},{\hat{V}_{o}}^{(H+1)}\equiv 0over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_H + 1 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_H + 1 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_H + 1 ) end_POSTSUPERSCRIPT ≡ 0.
for h=H to 1𝐻 to 1h=H\mbox{ to }1italic_h = italic_H to 1 do
     For all (x,a)𝒳×𝒜𝑥𝑎𝒳𝒜(x,a)\in\mathcal{X}\times\mathcal{A}( italic_x , italic_a ) ∈ caligraphic_X × caligraphic_A, \triangleright Estimate Q value.
Q^(h)(x,a)R^(h)(x,a)+xV^(h+1)(x)P^(h)(x|x,a)superscript^𝑄𝑥𝑎superscript^𝑅𝑥𝑎subscriptsuperscript𝑥superscript^𝑉1superscript𝑥superscript^𝑃conditionalsuperscript𝑥𝑥𝑎\displaystyle{\hat{Q}}^{(h)}(x,a)\leftarrow{\hat{R}}^{(h)}(x,a)+\int_{x^{% \prime}}{\hat{V}}^{(h+1)}(x^{\prime}){\hat{P}}^{(h)}(x^{\prime}|x,a)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) ← over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) + ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) (17)
     For all (x,a)𝒳×𝒜𝑥𝑎𝒳𝒜(x,a)\in\mathcal{X}\times\mathcal{A}( italic_x , italic_a ) ∈ caligraphic_X × caligraphic_A, \triangleright Selective pessimism.
Q^sp(h)(x,a)Q^(h)(x,a)b(h)(x,a)superscriptsubscript^𝑄sp𝑥𝑎superscript^𝑄𝑥𝑎superscript𝑏𝑥𝑎\displaystyle{\hat{Q}_{\text{sp}}}^{(h)}(x,a)\leftarrow{\hat{Q}}^{(h)}(x,a)-{b% }^{(h)}(x,a)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) ← over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) - italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) (18)
x|Δ^(h)(x|x,a)|(V^o(h+1)(x)V^p(h+1)(x)),\displaystyle-\int_{x^{\prime}}|{\hat{\Delta}}^{(h)}(x^{\prime}|x,a)|\cdot({% \hat{V}_{o}}^{(h+1)}(x^{\prime})-{\hat{V}_{p}}^{(h+1)}(x^{\prime})),- ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) | ⋅ ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,
π(h)(x)argmaxaQ^sp(h)(x,a)superscript𝜋𝑥subscriptargmax𝑎superscriptsubscript^𝑄sp𝑥𝑎\displaystyle{{\pi}}^{(h)}(x)\leftarrow\operatorname*{arg\,max}_{a}{\hat{Q}_{% \text{sp}}}^{(h)}(x,a)italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a )
     For all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, \triangleright Estimate pessimistic value.
V^p(h)(x)max(0,R^(h)(x,π(h)(x))\displaystyle{\hat{V}_{\text{p}}}^{(h)}(x)\leftarrow\max\Big{(}0,\;{\hat{R}}^{% (h)}(x,{\pi}^{(h)}(x))over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ← roman_max ( 0 , over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ) (19)
+xV^p(h+1)(x)P^(h)(x|x,a)b(h)(x,a))\displaystyle+\int_{x^{\prime}}{\hat{V}_{p}}^{(h+1)}(x^{\prime}){\hat{P}}^{(h)% }(x^{\prime}|x,a)-{b}^{(h)}(x,a)\Big{)}+ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) - italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) )
     For all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, \triangleright Estimate optimistic value.
V^o(h)(x)min(Vmax(h),R^(h)(x,π(h)(x))\displaystyle{\hat{V}_{o}}^{(h)}(x)\leftarrow\min\Big{(}{V_{\max}}^{(h)},\;{% \hat{R}}^{(h)}(x,{\pi}^{(h)}(x))over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ← roman_min ( italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ) (20)
+xV^o(h+1)(x)P^(h)(x|x,a)+b(h)(x,a))\displaystyle+\int_{x^{\prime}}{\hat{V}_{o}}^{(h+1)}(x^{\prime}){\hat{P}}^{(h)% }(x^{\prime}|x,a)+{b}^{(h)}(x,a)\Big{)}+ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) + italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) )
     For all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, \triangleright Estimate value.
V^(h)(x)min(V^o(h)(x),\displaystyle{\hat{V}}^{(h)}(x)\leftarrow\min\Big{(}{\hat{V}_{o}}^{(h)}(x),over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ← roman_min ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) , (21)
max(V^p(h)(x),Q^(h)(x,π(h)(x))))\displaystyle\max\Big{(}{\hat{V}_{p}}^{(h)}(x),{\hat{Q}}^{(h)}(x,{\pi}^{(h)}(x% ))\Big{)}\Big{)}roman_max ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) , over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ) ) )
end for

To demonstrate the benefits of our insights from Section 4, we now present an algorithm for offline policy learning. At a high-level, we modify pessimistic value iteration (PVI) [Jin et al., 2021], a dynamic programming (DP) algorithm that chooses the policy at each step hhitalic_h by maximizing pessimistic value function estimates. Pessimistic value maximization helps PVI avoids model exploitation – that is, PVI avoids picking policies with inaccurately high estimated values at step hhitalic_h by penalizing value function estimate uncertainty. We argue that, depending on the instance, such penalties due to estimate uncertainty may be larger than necessary for selecting the policy at step hhitalic_h.

In particular, based on the results in Section 4, uncertainty from later steps does not always need to be fully propagated to estimate lower bounds for the effect of deviating from the behavioral policy at step hhitalic_h (after fixing the policy for all future steps). Since we can view maximizing this treatment effect as the goal of DP algorithms at any step hhitalic_h, maximizing the tighter lower bounds from Section 4 should allow us to do better on easier CB-like instances (while always avoiding model exploitation) – motivating the design of Algorithm 1.

PVI often relies on point-wise estimation uncertainty bonus b=(b(h))h=1H𝑏superscriptsubscriptsuperscript𝑏1𝐻b=({b}^{(h)})_{h=1}^{H}italic_b = ( italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT to account for reward and transition estimation errors.555In finite-state MDPs, these bonuses can be count based, where b(h)(x,a)=β1/max{1,nh(x,a)}superscript𝑏𝑥𝑎𝛽11subscript𝑛𝑥𝑎{b}^{(h)}(x,a)=\beta\cdot\sqrt{1/\max\{1,n_{h}(x,a)\}}italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) = italic_β ⋅ square-root start_ARG 1 / roman_max { 1 , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_a ) } end_ARG and nh(x,a)subscript𝑛𝑥𝑎n_{h}(x,a)italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_x , italic_a ) is the number of times state x𝑥xitalic_x and action a𝑎aitalic_a are observed at step hhitalic_h in the data set S𝑆Sitalic_S. Here β𝛽\betaitalic_β is an algorithmic parameter. Several papers have extended these ideas to continuous state spaces [Bellemare et al., 2016, Osband et al., 2021]. For any (x,a,h)𝑥𝑎(x,a,h)( italic_x , italic_a , italic_h ), b(h)(x,a)superscript𝑏𝑥𝑎{b}^{(h)}(x,a)italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) bounds the total model estimation errors at (x,a,h)𝑥𝑎(x,a,h)( italic_x , italic_a , italic_h ). At any hhitalic_h, PVI uses bonuses to account for estimation errors at step hhitalic_h and propagate estimation errors from future steps (by propagating pessimistic values from step h+11h+1italic_h + 1 that were computed using these bonuses). We also similarly rely on these bonuses to account for model estimation errors.

Our algorithm (Algorithm 1), selectively pessimistic value iteration (SPVI), takes as input: an estimated reward model R^^𝑅\hat{R}over^ start_ARG italic_R end_ARG, an estimated transition model P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG, an induced shift model Δ^^Δ\hat{\Delta}over^ start_ARG roman_Δ end_ARG such that Δ^(h)(x|x,a)P^(h)(x|x,a)P^(h)(x|x,πb)superscript^Δconditionalsuperscript𝑥𝑥𝑎superscript^𝑃conditionalsuperscript𝑥𝑥𝑎superscript^𝑃conditionalsuperscript𝑥𝑥subscript𝜋𝑏{\hat{\Delta}}^{(h)}(x^{\prime}|x,a)\equiv{\hat{P}}^{(h)}(x^{\prime}|x,a)-{% \hat{P}}^{(h)}(x^{\prime}|x,\pi_{b})over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) ≡ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) - over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) for all (x,a,x,h)𝑥𝑎superscript𝑥(x,a,x^{\prime},h)( italic_x , italic_a , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_h ), and a point-wise estimation uncertainty bonus b=(b(h))h=1H𝑏superscriptsubscriptsuperscript𝑏1𝐻b=({b}^{(h)})_{h=1}^{H}italic_b = ( italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. No additional data is required. The algorithm uses these inputs to construct a policy π=(π(h))h=1H𝜋superscriptsubscriptsuperscript𝜋1𝐻\pi=({\pi}^{(h)})_{h=1}^{H}italic_π = ( italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT iteratively. That is, SPVI constructs π(h)superscript𝜋{\pi}^{(h)}italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT while iterating over steps h=H𝐻h=Hitalic_h = italic_H to h=11h=1italic_h = 1.

At step hhitalic_h, SPVI computes: (1) the Q-value estimate for step hhitalic_h (which is well defined since (π(h+1),..,π(H))({\pi}^{(h+1)},..,{\pi}^{(H)})( italic_π start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT , . . , italic_π start_POSTSUPERSCRIPT ( italic_H ) end_POSTSUPERSCRIPT ) is fixed) denoted by Q^(h)superscript^𝑄{\hat{Q}}^{(h)}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT (see (17)), (2) the policy at step hhitalic_h denoted by π(h)superscript𝜋{\pi}^{(h)}italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT (see (18)), and (3) pessimistic (V^p(h)superscriptsubscript^𝑉𝑝{\hat{V}_{p}}^{(h)}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT), optimistic (V^o(h)superscriptsubscript^𝑉𝑜{\hat{V}_{o}}^{(h)}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT), and standard (V^(h)superscript^𝑉{\hat{V}}^{(h)}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT) value functions for step hhitalic_h (see (19), (20), and (21)). Steps one and three are fairly standard, as several RL algorithms construct Q-values and pessimistic/standard/optimistic value functions. Hence, for brevity, we only explain step 2 (constructing π(h)superscript𝜋{\pi}^{(h)}italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT) – the crux of our modification.

As we argued earlier, one can view the goal at step hhitalic_h as selecting π(h)superscript𝜋{\pi}^{(h)}italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT in order to maximize απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. We construct a tight lower bound for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and argue that step 2 maximizes this bound. Similar to standard pessimistic value estimation, we let the bonus b(h)(x,a)superscript𝑏𝑥𝑎{b}^{(h)}(x,a)italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) bound the total model estimation errors at (x,a,h)𝑥𝑎(x,a,h)( italic_x , italic_a , italic_h ). Now from the analysis in Theorem 4.1, we have (22) is a valid lowed bound on απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.

απ(h)=𝔼D(πb)[Vπ~h(h)(x(h))Vπ~h+1(h)(x(h))]subscriptsuperscript𝛼𝜋subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑉subscript~𝜋superscript𝑥subscriptsuperscript𝑉subscript~𝜋1superscript𝑥\displaystyle{\alpha}^{(h)}_{\pi}=\mathbb{E}_{D(\pi_{b})}[{V}^{(h)}_{\tilde{% \pi}_{h}}({x}^{(h)})-{V}^{(h)}_{\tilde{\pi}_{h+1}}({x}^{(h)})]italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ] (22)
𝔼D(πb)[Q^(h)(x(h),π(h))Q^(h)(x(h),πb(h))\displaystyle\geq\mathbb{E}_{D(\pi_{b})}\bigg{[}{\hat{Q}}^{(h)}({x}^{(h)},{\pi% }^{(h)})-{\hat{Q}}^{(h)}({x}^{(h)},{\pi_{b}}^{(h)})≥ blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT )
b(h)(x(h),π(h))x|Δ^(h)(x|x(h),π)|Γ(h+1)(x)]\displaystyle-{b}^{(h)}({x}^{(h)},{\pi}^{(h)})-\int_{x^{\prime}}|{\hat{\Delta}% }^{(h)}(x^{\prime}|{x}^{(h)},\pi)|{\Gamma}^{(h+1)}(x^{\prime})\bigg{]}- italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) | roman_Γ start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]
=𝔼D(πb)[Q^sp(h)(x(h),π(h))Q^(h)(x(h),πb(h))]absentsubscript𝔼𝐷subscript𝜋𝑏delimited-[]superscriptsubscript^𝑄spsuperscript𝑥superscript𝜋superscript^𝑄superscript𝑥superscriptsubscript𝜋𝑏\displaystyle=\mathbb{E}_{D(\pi_{b})}[{\hat{Q}_{\text{sp}}}^{(h)}({x}^{(h)},{% \pi}^{(h)})-{\hat{Q}}^{(h)}({x}^{(h)},{\pi_{b}}^{(h)})]= blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ]

Here Γ(h+1)(x)superscriptΓ1𝑥{\Gamma}^{(h+1)}(x)roman_Γ start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) is a shorthand for V^o(h+1)(x)V^p(h+1)(x)superscriptsubscript^𝑉𝑜1𝑥superscriptsubscript^𝑉𝑝1𝑥{\hat{V}_{o}}^{(h+1)}(x)-{\hat{V}_{p}}^{(h+1)}(x)over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ) - over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x ); and for any g{Q^(h),Q^sp(h),b(h)}𝑔superscript^𝑄superscriptsubscript^𝑄spsuperscript𝑏g\in\{{\hat{Q}}^{(h)},{\hat{Q}_{\text{sp}}}^{(h)},{b}^{(h)}\}italic_g ∈ { over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT } we let g(x,π(h))𝑔𝑥superscript𝜋g(x,{\pi}^{(h)})italic_g ( italic_x , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) be a shorthand for g(x,π(h)(x))𝑔𝑥superscript𝜋𝑥g(x,{\pi}^{(h)}(x))italic_g ( italic_x , italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ). Now from (18), since π(h)(x)argmaxaQ^sp(h)(x,a)superscript𝜋𝑥subscriptargmax𝑎superscriptsubscript^𝑄sp𝑥𝑎{{\pi}}^{(h)}(x)\equiv\operatorname*{arg\,max}_{a}{\hat{Q}_{\text{sp}}}^{(h)}(% x,a)italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x ) ≡ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT sp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , italic_a ) for all x𝑥xitalic_x, we have that step 2 (constructing π(h)superscript𝜋{\pi}^{(h)}italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT) is equivalent to maximizing the lower bound in (22). Importantly, by maximizing this tight lower bound we can do better on easier CB-like instances (while always avoiding model exploitation by penalizing uncertainty in estimating απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT). This completes our justification of Algorithm 1 (SPVI).

6 Simulation

Refer to caption

Figure 1: ChainBandit MDP with horizon/length of 4 (this is an adjustable environment parameter). The environment has two chains, a top chain, and a bottom chain. The environment also has three actions given by a1,a2,a3subscript𝑎1subscript𝑎2subscript𝑎3a_{1},a_{2},a_{3}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The top chain states are the most rewarding. The agent starts at (1,0). At any state in the bottom chain, all the actions lead to the same transition (which is to move to the next state in the bottom chain) and are essentially bandit states. In the top chain, both a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT lead to the same transitions (which is to move to the next state in the top chain), and a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT makes the agent move to the next state in the bottom chain. In the top chain, the highest cumulative reward comes from never taking action a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT; however, the highest immediate reward comes from selecting the action a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (which makes planning beneficial in this environment). Note that at every state, action a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is a sub-optimal action.

To illustrate our insights, we consider a simple toy environment called “ChainBandit". As described in Figure 1, this environment is constructed to have both dynamic (some actions result in a different next-state distribution) and bandit-like (non-dynamic) elements. Ensuring that while planning and some estimate/uncertainty propagation is necessary, complete uncertainty propagation can be unnecessary to evaluate policies of interest. Throughout this section, we consider ChainBandit with a chain length of 3 and consider the following behavioral policy (πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT) – at every state and step, πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT selects action a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with probability 0.80.80.80.8 and selects the other two actions with probability 0.10.10.10.1 respectively. From the data collected, we evaluate standard and selective uncertainty propagation for tasks of: (1) estimating upper/lower bounds for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT; and (2) offline policy learning.

Since uncertainty propagation is the focus of this simulation, in order to make a fair comparison, both standard and selective propagation: use the same tabular approach to estimate a (step independent) reward/transition model; and use the same (step independent) count-based bonuses to account for model estimation errors.666The model estimates and bonuses are not step dependent since the reward/transition functions in Chain Bandit are the same for all steps. Further a tabular approach to reward (transition) estimation simply indicates using the average reward (average one-hot next-state vector) observed at any state-action pair as its reward (transition) estimate. Here bonus is given by b(x,a)=ln(|𝒳|*|𝒜|*H/δ)/n(x,a)𝑏𝑥𝑎𝒳𝒜𝐻𝛿𝑛𝑥𝑎b(x,a)=\sqrt{\ln(|\mathcal{X}|*|\mathcal{A}|*H/\delta)/n(x,a)}italic_b ( italic_x , italic_a ) = square-root start_ARG roman_ln ( | caligraphic_X | * | caligraphic_A | * italic_H / italic_δ ) / italic_n ( italic_x , italic_a ) end_ARG – where n(x,a)𝑛𝑥𝑎n(x,a)italic_n ( italic_x , italic_a ) is the number of times action a𝑎aitalic_a was taken at state x𝑥xitalic_x, and confidence parameter δ=0.05𝛿0.05\delta=0.05italic_δ = 0.05.

For the step hhitalic_h and policy π𝜋\piitalic_π of interest, standard CIs for απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT are constructed using pessimistic/optimistic value estimates at step hhitalic_h for policies π~hsubscript~𝜋\tilde{\pi}_{h}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and π~h+1subscript~𝜋1\tilde{\pi}_{h+1}over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT – i.e, utilizing (23).

𝔼D(πb)[V^π~h,o(h)(x(h))V^π~h+1,p(h)(x(h))]απ(h)subscript𝔼𝐷subscript𝜋𝑏delimited-[]superscriptsubscript^𝑉subscript~𝜋𝑜superscript𝑥superscriptsubscript^𝑉subscript~𝜋1𝑝superscript𝑥subscriptsuperscript𝛼𝜋\displaystyle\mathbb{E}_{D(\pi_{b})}[{\hat{V}_{\tilde{\pi}_{h},o}}^{(h)}({x}^{% (h)})-{\hat{V}_{\tilde{\pi}_{h+1},p}}^{(h)}({x}^{(h)})]\geq{\alpha}^{(h)}_{\pi}blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ] ≥ italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (23)
𝔼D(πb)[V^π~h,p(h)(x(h))V^π~h+1,o(h)(x(h))]absentsubscript𝔼𝐷subscript𝜋𝑏delimited-[]superscriptsubscript^𝑉subscript~𝜋𝑝superscript𝑥superscriptsubscript^𝑉subscript~𝜋1𝑜superscript𝑥\displaystyle\geq\mathbb{E}_{D(\pi_{b})}[{\hat{V}_{\tilde{\pi}_{h},p}}^{(h)}({% x}^{(h)})-{\hat{V}_{\tilde{\pi}_{h+1},o}}^{(h)}({x}^{(h)})]≥ blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ]

For selective CIs, we use (22) to construct the lower bound and similarly construct the upper bound. Note that converting inequalities like (22) and (23) into empirical bounds is straightforward since our training is from D(πb)𝐷subscript𝜋𝑏D(\pi_{b})italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). Figure 2 plots CIs for both selective and standard uncertainty propagation, when varying the evaluation policy. As expected, benefits over standard pessimism are larger when next-state distribution shift is smaller – that is, when evaluation policy is closer to the behavioral policy.

In Figure 3, we plot the value of learnt policy from various algorithms as we vary the number of training episodes. In particular, we compare SPVI, PVI [Jin et al., 2021], and pessimistic supervised learning (PSL). Here PSL refers to a pessimistic bandit policy optimization applied to each step without planning. The ChainBandit environment benefits from planning, so PSL performs poorly as expected.

On all Chain Bandit simulations we tried, SPVI was by far the best-performing algorithm. The reason we considered the behavioral policy described earlier was that it was more disadvantageous for SPVI. In particular, since selective pessimism has an initial bias against policies that lead to significant shifts, we chose a highly sub-optimal behavioral policy (selecting a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with probability 0.80.80.80.8). While this leads to a worse start for SPVI than PVI, eventually, SPVI outperforms the other algorithms. We also run simulations for CI construction and policy learning on the standard GridWorld (see Appendix B) – since this is a very dynamic environment, both standard and selective propagation have similar performance.777All algorithm runs takes less than 2 mins on a MacBook Pro M2 16GB.

Refer to caption

Figure 2: We plot CIs for απ(2)subscriptsuperscript𝛼2𝜋{\alpha}^{(2)}_{\pi}italic_α start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT while varying the evaluation policy. These evaluation policies are parameterized by λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ]. For all states/steps, the probability of selecting a1,a2subscript𝑎1subscript𝑎2a_{1},a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are (1λ)/2,(1λ)/2,1𝜆21𝜆2(1-\lambda)/2,(1-\lambda)/2,( 1 - italic_λ ) / 2 , ( 1 - italic_λ ) / 2 , and λ𝜆\lambdaitalic_λ respectively. Note that the evaluation policy is the same as the behavioral policy for λ=0.8𝜆0.8\lambda=0.8italic_λ = 0.8. The number of training episodes is 10000, and the plots are averaged over 10 runs.

Refer to caption

Figure 3: Policy learning with a bad behavioral policy

Conclusion: We introduce selective propagation, a general approach to interpolate between CB and RL techniques – achieving guarantees that adapt to instance hardness. Further developing this could impact real world problems (e.g., recommendation systems, mHealth, EdTech) that lie in between the two settings.

References

  • Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
  • Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, pages 1638–1646, 2014.
  • Agrawal and Goyal [2012] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceeding of the 25th Annual Conference on Learning Theory, pages 39.1–39.26, 2012.
  • Agrawal and Goyal [2013] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127–135, 2013.
  • Bellemare et al. [2016] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29, 2016.
  • Bobu et al. [2018] Andreea Bobu, Eric Tzeng, Judy Hoffman, and Trevor Darrell. Adapting to continuously shifting domains. workshop, 2018.
  • Carranza et al. [2023] Aldo Gael Carranza, Sanath Kumar Krishnamurthy, and Susan Athey. Flexible and efficient contextual bandits with heterogeneous treatment effect oracles. In International Conference on Artificial Intelligence and Statistics, pages 7190–7212. PMLR, 2023.
  • Chapelle and Li [2012] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2012.
  • Dani et al. [2008] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pages 355–366, 2008.
  • Dudik et al. [2014] Miroslav Dudik, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Farshchian et al. [2018] Ali Farshchian, Juan A Gallego, Joseph P Cohen, Yoshua Bengio, Lee E Miller, and Sara A Solla. Adversarial domain adaptation for stable brain-machine interfaces. arXiv preprint arXiv:1810.00045, 2018.
  • Foster and Rakhlin [2023] Dylan J Foster and Alexander Rakhlin. Foundations of reinforcement learning and interactive decision making. arXiv preprint arXiv:2312.16730, 2023.
  • Foster and Syrgkanis [2019] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
  • Foster et al. [2021] Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021.
  • Garivier and Cappe [2011] Aurelien Garivier and Olivier Cappe. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceeding of the 24th Annual Conference on Learning Theory, pages 359–376, 2011.
  • Jin et al. [2021] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
  • Krishnamurthy et al. [2018] Akshay Krishnamurthy, Zhiwei Steven Wu, and Vasilis Syrgkanis. Semiparametric contextual bandits. In International Conference on Machine Learning, pages 2776–2785. PMLR, 2018.
  • Künzel et al. [2019] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165, 2019.
  • Langford and Zhang [2008] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20, pages 817–824, 2008.
  • Lattimore and Szepesvari [2019] Tor Lattimore and Csaba Szepesvari. Bandit Algorithms. Cambridge University Press, 2019.
  • Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review. and Perspectives on Open Problems, 2020.
  • Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010.
  • Martin et al. [2017] Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter. Count-based exploration in feature space for reinforcement learning. arXiv preprint arXiv:1706.08090, 2017.
  • Moerland et al. [2023] Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1):1–118, 2023.
  • Nie and Wager [2021] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021.
  • Osband et al. [2021] Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, and Benjamin Van Roy. Epistemic neural networks. arXiv preprint arXiv:2107.08924, 2021.
  • Russo et al. [2018] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Sutton [1988] Richard Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
  • Sutton and Barto [1998] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
  • Sutton et al. [2000] Richard Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063, 2000.
  • Syrgkanis and Zhan [2023] Vasilis Syrgkanis and Ruohan Zhan. Post-episodic reinforcement learning inference. arXiv preprint arXiv:2302.08854, 2023.
  • Thompson [1933] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  • Vergara et al. [2012] Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A Ryan, Margie L Homer, and Ramón Huerta. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical, 166:320–329, 2012.
  • Wang et al. [2019] Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
  • Williams [1992] Ronald Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
  • Yin and Wang [2021] Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078, 2021.
  • Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  • Zhang et al. [2022] Kelly W Zhang, Omer Gottesman, and Finale Doshi-Velez. A bayesian approach to learning bandit structure in markov decision processes. arXiv preprint arXiv:2208.00250, 2022.

Selective Uncertainty Propagation in Offline RL
(Supplementary Material)

Appendix A Proof for Section 4

In this section we prove our main theoretical result, Theorem 4.1. For convenience, we restate it below. See 4.1

We now begin the proof of Theorem 4.1. We start by proving Lemma A.1, which shows α~π(h)subscriptsuperscript~𝛼𝜋{\tilde{\alpha}}^{(h)}_{\pi}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is close to our target estimand. Where α~π(h)subscriptsuperscript~𝛼𝜋{\tilde{\alpha}}^{(h)}_{\pi}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is defined in (24) and can be viewed as a less empirical version of α^π(h)subscriptsuperscript^𝛼𝜋{\hat{\alpha}}^{(h)}_{\pi}over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (our treatment effect estimator at step hhitalic_h).

α~π(h)=θ^π(h)+𝔼D(πb)[xV^π(h+1)(x)Δ^(h)(x|x(h),π)]subscriptsuperscript~𝛼𝜋subscriptsuperscript^𝜃𝜋subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑥superscriptsubscript^𝑉𝜋1superscript𝑥superscript^Δconditionalsuperscript𝑥superscript𝑥𝜋{\tilde{\alpha}}^{(h)}_{\pi}={\hat{\theta}}^{(h)}_{\pi}+\mathbb{E}_{D(\pi_{b})% }\bigg{[}\int_{x^{\prime}}{\hat{V}_{\pi}}^{(h+1)}(x^{\prime}){\hat{\Delta}}^{(% h)}(x^{\prime}|{x}^{(h)},\pi)\bigg{]}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] (24)
Lemma A.1.

Under the conditions of Theorem 4.1, we show the following bound holds with probability 1δ𝑖𝑛1subscript𝛿𝑖𝑛1-\delta_{\text{in}}1 - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT:

|α~π(h)απ(h)|L~π(h).subscriptsuperscript~𝛼𝜋subscriptsuperscript𝛼𝜋superscriptsubscript~𝐿𝜋\displaystyle|{\tilde{\alpha}}^{(h)}_{\pi}-{\alpha}^{(h)}_{\pi}|\leq{\tilde{L}% _{\pi}}^{(h)}.| over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | ≤ over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT . (25)

Where L~π(h)superscriptsubscriptnormal-~𝐿𝜋{\tilde{L}_{\pi}}^{(h)}over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT is given by (26).

L~π(h)=κπ,θ(h)+Vmax(h+1)κπ,Δ(h)+𝔼D(πb)[x|𝔼D(πb)[Δ^(h)(x|x(h),π)]|Γπ(h+1)(x)]\displaystyle{\tilde{L}_{\pi}}^{(h)}={\kappa}^{(h)}_{\pi,\theta}+{V_{\max}}^{(% h+1)}{\kappa}^{(h)}_{\pi,\Delta}+\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{% \prime}}|\mathbb{E}_{D(\pi_{b})}[{\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi% )]|\cdot{{\Gamma}_{\pi}}^{(h+1)}(x^{\prime})\bigg{]}over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT = italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] | ⋅ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] (26)
Proof.

We start by simplifying απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, defined in (4).

απ(h)=𝔼D(πb)[Vπ~h(h)(x(h))Vπ~h+1(h)(x(h))]subscriptsuperscript𝛼𝜋subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑉subscript~𝜋superscript𝑥subscriptsuperscript𝑉subscript~𝜋1superscript𝑥\displaystyle{\alpha}^{(h)}_{\pi}=\mathbb{E}_{D(\pi_{b})}[{V}^{(h)}_{\tilde{% \pi}_{h}}({x}^{(h)})-{V}^{(h)}_{\tilde{\pi}_{h+1}}({x}^{(h)})]italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ] (27)
=(i)𝔼D(πb)[(R(h)(x,π~h)+xVπ(h+1)(x)P(h)(x|x,π~h))(R(h)(x,π~h+1)+xVπ(h+1)(x)P(h)(x|x,π~h+1))]superscript𝑖absentsubscript𝔼𝐷subscript𝜋𝑏delimited-[]superscript𝑅𝑥subscript~𝜋subscriptsuperscript𝑥superscriptsubscript𝑉𝜋1superscript𝑥superscript𝑃conditionalsuperscript𝑥𝑥subscript~𝜋superscript𝑅𝑥subscript~𝜋1subscriptsuperscript𝑥superscriptsubscript𝑉𝜋1superscript𝑥superscript𝑃conditionalsuperscript𝑥𝑥subscript~𝜋1\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\mathbb{E}_{D(\pi_{b})}\bigg{[}% \bigg{(}{R}^{(h)}(x,\tilde{\pi}_{h})+\int_{x^{\prime}}{V_{\pi}}^{(h+1)}(x^{% \prime}){P}^{(h)}(x^{\prime}|x,\tilde{\pi}_{h})\bigg{)}-\bigg{(}{R}^{(h)}(x,% \tilde{\pi}_{h+1})+\int_{x^{\prime}}{V_{\pi}}^{(h+1)}(x^{\prime}){P}^{(h)}(x^{% \prime}|x,\tilde{\pi}_{h+1})\bigg{)}\bigg{]}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ( italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) - ( italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ) ]
=(ii)𝔼D(πb)[(R(h)(x,π~h)R(h)(x,π~h+1))+(xVπ(h+1)(x)(P(h)(x|x,π~h)P(h)(x|x,π~h+1)))]superscript𝑖𝑖absentsubscript𝔼𝐷subscript𝜋𝑏delimited-[]superscript𝑅𝑥subscript~𝜋superscript𝑅𝑥subscript~𝜋1subscriptsuperscript𝑥superscriptsubscript𝑉𝜋1superscript𝑥superscript𝑃conditionalsuperscript𝑥𝑥subscript~𝜋superscript𝑃conditionalsuperscript𝑥𝑥subscript~𝜋1\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\mathbb{E}_{D(\pi_{b})}\bigg{[}% \bigg{(}{R}^{(h)}(x,\tilde{\pi}_{h})-{R}^{(h)}(x,\tilde{\pi}_{h+1})\bigg{)}+% \bigg{(}\int_{x^{\prime}}{V_{\pi}}^{(h+1)}(x^{\prime})\Big{(}{P}^{(h)}(x^{% \prime}|x,\tilde{\pi}_{h})-{P}^{(h)}(x^{\prime}|x,\tilde{\pi}_{h+1})\Big{)}% \bigg{)}\bigg{]}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ( italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_R start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ) + ( ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_P start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) ) ) ]
=(iii)θπ(h)+𝔼D(πb)[xVπ(h+1)(x)Δ(h)(x|x,π)]superscript𝑖𝑖𝑖absentsubscriptsuperscript𝜃𝜋subscript𝔼𝐷subscript𝜋𝑏delimited-[]subscriptsuperscript𝑥superscriptsubscript𝑉𝜋1superscript𝑥superscriptΔconditionalsuperscript𝑥𝑥𝜋\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}}{\theta}^{(h)}_{\pi}+\mathbb{E% }_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}{V_{\pi}}^{(h+1)}(x^{\prime}){\Delta}^{% (h)}(x^{\prime}|x,\pi)\bigg{]}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i italic_i italic_i ) end_ARG end_RELOP italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) ]

Here (i) follows from (2), (ii) follows from re-arranging terms, and (iii) follows from (7) and (5).

With probability 1δin1subscript𝛿in1-\delta_{\text{in}}1 - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT, we have the input guarantees of Section 4 hold. Hence, utilizing these guarantees, we can bound the distance between α~π(h)subscriptsuperscript~𝛼𝜋{\tilde{\alpha}}^{(h)}_{\pi}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and the target estimand απ(h)subscriptsuperscript𝛼𝜋{\alpha}^{(h)}_{\pi}italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.

|α~π(h)απ(h)|subscriptsuperscript~𝛼𝜋subscriptsuperscript𝛼𝜋\displaystyle|{\tilde{\alpha}}^{(h)}_{\pi}-{\alpha}^{(h)}_{\pi}|| over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | (28)
=(i)|(θ^π(h)+𝔼D(πb)[xV^π(h+1)(x)Δ^(h)(x|x(h),π)])(θπ(h)+𝔼D(πb)[xVπ(h+1)(x)Δ(h)(x|x,π)])|\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\bigg{|}\bigg{(}{\hat{\theta}}^{% (h)}_{\pi}+\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}{\hat{V}_{\pi}}^{(h% +1)}(x^{\prime}){\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi)\bigg{]}\bigg{)}% -\bigg{(}{\theta}^{(h)}_{\pi}+\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}% {V_{\pi}}^{(h+1)}(x^{\prime}){\Delta}^{(h)}(x^{\prime}|x,\pi)\bigg{]}\bigg{)}% \bigg{|}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i ) end_ARG end_RELOP | ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] ) - ( italic_θ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_π ) ] ) |
(ii)κπ,θ(h)+|𝔼D(πb)[x(V^π(h+1)(x)Δ^(h)(x|x(h),π)Vπ(h+1)(x)Δ(h)(x|x(h),π))]|\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}{\kappa}^{(h)}_{\pi,\theta}+% \bigg{|}\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}\bigg{(}{\hat{V}_{\pi}% }^{(h+1)}(x^{\prime}){\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi)-{V_{\pi}}^% {(h+1)}(x^{\prime}){{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi)\bigg{)}\bigg{]}% \bigg{|}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT + | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) - italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ) ] |
(iii)κπ,θ(h)+|𝔼D(πb)[x(V^π(h+1)(x)Vπ(h+1)(x))Δ^(h)(x|x(h),π)]|\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}{\kappa}^{(h)}_{\pi,\theta}% +\bigg{|}\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}\Big{(}{\hat{V}_{\pi}% }^{(h+1)}(x^{\prime})-{V_{\pi}}^{(h+1)}(x^{\prime})\Big{)}{\hat{\Delta}}^{(h)}% (x^{\prime}|{x}^{(h)},\pi)\bigg{]}\bigg{|}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i italic_i ) end_ARG end_RELOP italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT + | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] |
+|𝔼D(πb)[xVπ(h+1)(x)(Δ^(h)Δ(h))(x|x(h),π))]|\displaystyle+\bigg{|}\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}{V_{\pi}% }^{(h+1)}(x^{\prime})\Big{(}{\hat{\Delta}}^{(h)}-{{\Delta}}^{(h)}\Big{)}(x^{% \prime}|{x}^{(h)},\pi)\bigg{)}\bigg{]}\bigg{|}+ | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT - roman_Δ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ) ] |
(iv)κπ,θ(h)+𝔼D(πb)[x|𝔼D(πb)[Δ^(h)(x|x(h),π)]|Γπ(h+1)(x)]+Vmax(h+1)κπ,Δ(h)=:L~π(h)\displaystyle\stackrel{{\scriptstyle(iv)}}{{\leq}}{\kappa}^{(h)}_{\pi,\theta}+% \mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}|\mathbb{E}_{D(\pi_{b})}[{\hat% {\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi)]|\cdot{{\Gamma}_{\pi}}^{(h+1)}(x^{% \prime})\bigg{]}+{V_{\max}}^{(h+1)}{\kappa}^{(h)}_{\pi,\Delta}=:{\tilde{L}_{% \pi}}^{(h)}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_v ) end_ARG end_RELOP italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , italic_θ end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] | ⋅ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π , roman_Δ end_POSTSUBSCRIPT = : over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT

Here (i) follows from (24) and (27), (ii) follows from triangle inequality and the input guarantee, (iii) follows from triangle inequality, and finally (iv) follows from Cauchy-Schwarz inequality and the input guarantees. ∎

Having shown in Lemma A.1 that α~π(h)subscriptsuperscript~𝛼𝜋{\tilde{\alpha}}^{(h)}_{\pi}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is close to our target estimand, we only need to show two more facts to complete the proof of Theorem 4.1: (i) we need to show α~π(h)subscriptsuperscript~𝛼𝜋{\tilde{\alpha}}^{(h)}_{\pi}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is close to our treatment effect estimator α^π(h)subscriptsuperscript^𝛼𝜋{\hat{\alpha}}^{(h)}_{\pi}over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (defined in (13)), and (ii) we need to show L~π(h)superscriptsubscript~𝐿𝜋{\tilde{L}_{\pi}}^{(h)}over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT is sufficiently smaller that Lπ(h)superscriptsubscript𝐿𝜋{{L}_{\pi}}^{(h)}italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT. We show both these statements hold with high-probability.

We start with showing α~π(h)subscriptsuperscript~𝛼𝜋{\tilde{\alpha}}^{(h)}_{\pi}over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is close to α^π(h)subscriptsuperscript^𝛼𝜋{\hat{\alpha}}^{(h)}_{\pi}over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT with high-probability. In particular, we have (29) holds with probability at least 1δ/21𝛿21-\delta/21 - italic_δ / 2.

|α~π(h)α^π(h)|subscriptsuperscript~𝛼𝜋subscriptsuperscript^𝛼𝜋\displaystyle|{\tilde{\alpha}}^{(h)}_{\pi}-{\hat{\alpha}}^{(h)}_{\pi}|| over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | (29)
=|𝔼D(πb)[xV^π(h+1)(x)Δ^(h)(x|x(h),π)]1Tt=1TxV^π(h+1)(x)Δ^(h)(x|xt(h),π)|\displaystyle=\bigg{|}\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}{\hat{V}% _{\pi}}^{(h+1)}(x^{\prime}){\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi)\bigg% {]}-\frac{1}{T}\sum_{t=1}^{T}\int_{x^{\prime}}{\hat{V}_{\pi}}^{(h+1)}(x^{% \prime}){\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)}_{t},\pi)\bigg{|}= | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) |
2Vmax(h+1)ln(4/δ)2T.absent2superscriptsubscript𝑉14𝛿2𝑇\displaystyle\leq 2{V_{\max}}^{(h+1)}\sqrt{\frac{\ln(4/\delta)}{2T}}.≤ 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( 4 / italic_δ ) end_ARG start_ARG 2 italic_T end_ARG end_ARG .

In (29), the first equality follows from (13) and (24). The first inequality follows from Hoeffding’s inequality the fact that xV^π(h+1)(x)Δ^(h)(x|xt(h),π)[Vmax(h+1),Vmax(h+1)]subscriptsuperscript𝑥superscriptsubscript^𝑉𝜋1superscript𝑥superscript^Δconditionalsuperscript𝑥subscriptsuperscript𝑥𝑡𝜋superscriptsubscript𝑉1superscriptsubscript𝑉1\int_{x^{\prime}}{\hat{V}_{\pi}}^{(h+1)}(x^{\prime}){\hat{\Delta}}^{(h)}(x^{% \prime}|{x}^{(h)}_{t},\pi)\in[-{V_{\max}}^{(h+1)},{V_{\max}}^{(h+1)}]∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) ∈ [ - italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ].

From Hoeffding’s inequality and the fact that (16) holds, we also have (30) holds with probability at least 1δ/21𝛿21-\delta/21 - italic_δ / 2.

𝔼D(πb)[x|𝔼D(πb)[Δ^(h)(x|x(h),π)]|Γπ(h+1)(x)]1Tt=1Tx|𝔼D(πb)[Δ^(h)(x|xt(h),π)]|Γπ(h+1)(x)\displaystyle\mathbb{E}_{D(\pi_{b})}\bigg{[}\int_{x^{\prime}}|\mathbb{E}_{D(% \pi_{b})}[{\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)},\pi)]|\cdot{{\Gamma}_{\pi}% }^{(h+1)}(x^{\prime})\bigg{]}-\frac{1}{T}\sum_{t=1}^{T}\int_{x^{\prime}}|% \mathbb{E}_{D(\pi_{b})}[{\hat{\Delta}}^{(h)}(x^{\prime}|{x}^{(h)}_{t},\pi)]|% \cdot{{\Gamma}_{\pi}}^{(h+1)}(x^{\prime})blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , italic_π ) ] | ⋅ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_D ( italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) ] | ⋅ roman_Γ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (30)
4Vmax(h+1)ln(4/δ)2T.absent4superscriptsubscript𝑉14𝛿2𝑇\displaystyle\leq 4{V_{\max}}^{(h+1)}\sqrt{\frac{\ln(4/\delta)}{2T}}.≤ 4 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( 4 / italic_δ ) end_ARG start_ARG 2 italic_T end_ARG end_ARG .

Note that (30) implies L~π(h)+2Vmax(h+1)ln(4/δ)2Tsubscriptsuperscript~𝐿𝜋2superscriptsubscript𝑉14𝛿2𝑇{\tilde{L}}^{(h)}_{\pi}+2{V_{\max}}^{(h+1)}\sqrt{\frac{\ln(4/\delta)}{2T}}over~ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( 4 / italic_δ ) end_ARG start_ARG 2 italic_T end_ARG end_ARG is no larger than Lπ(h)subscriptsuperscript𝐿𝜋{L}^{(h)}_{\pi}italic_L start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.

Hence, with probability at least 1δinδ1subscript𝛿in𝛿1-\delta_{\text{in}}-\delta1 - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT - italic_δ, we have (25), (29), and (30) hold. We now use these equation to show that (31) holds.

|απ(h)α^π(h)|subscriptsuperscript𝛼𝜋subscriptsuperscript^𝛼𝜋\displaystyle|{\alpha}^{(h)}_{\pi}-{\hat{\alpha}}^{(h)}_{\pi}|| italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | (31)
(i)|απ(h)α~π(h)|+|α~π(h)α^π(h)|superscript𝑖absentsubscriptsuperscript𝛼𝜋subscriptsuperscript~𝛼𝜋subscriptsuperscript~𝛼𝜋subscriptsuperscript^𝛼𝜋\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}|{\alpha}^{(h)}_{\pi}-{\tilde% {\alpha}}^{(h)}_{\pi}|+|{\tilde{\alpha}}^{(h)}_{\pi}-{\hat{\alpha}}^{(h)}_{\pi}|start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP | italic_α start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | + | over~ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over^ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT |
(ii)L~π(h)+2Vmax(h+1)ln(4/δ)2Tsuperscript𝑖𝑖absentsuperscriptsubscript~𝐿𝜋2superscriptsubscript𝑉14𝛿2𝑇\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}{\tilde{L}_{\pi}}^{(h)}+2{V_% {\max}}^{(h+1)}\sqrt{\frac{\ln(4/\delta)}{2T}}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP over~ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT + 2 italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h + 1 ) end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( 4 / italic_δ ) end_ARG start_ARG 2 italic_T end_ARG end_ARG
(iii)Lπ(h).superscript𝑖𝑖𝑖absentsuperscriptsubscript𝐿𝜋\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}{{L}_{\pi}}^{(h)}.start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i italic_i ) end_ARG end_RELOP italic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT .

Here (i) follows from triangle inequality; (ii) follows from (25) and (29); and (iii) follows from (30). Therefore, we have shown that (31) holds with probability at least 1δinδ1subscript𝛿in𝛿1-\delta_{\text{in}}-\delta1 - italic_δ start_POSTSUBSCRIPT in end_POSTSUBSCRIPT - italic_δ. This completes the proof of Theorem 4.1.

Appendix B Additional Simulation

Similar to Section 6, we now run simulations for the standard GridWorld environment (with width 8888 and height 3333). Here states are discrete points on a bounded two-dimensional grid. The agent’s starting state is sampled uniformly at random from the grid, and the agent should learn to reach a specified goal state (which is an absorbing state). Upon transitioning to the goal state, the agent receives a reward of one and receives a reward of zero otherwise. The agent has 4 actions (left, right, up, and down). These actions make the agent move one step in that direction if possible. If the agent is at the boundary of the grid and can’t move in the direction selected, the agent continues to stay in the same state. Since we make the goal state an absorbing state, all actions at this state lead to the agent continuing to stay in this state. We set the starting state for the GridWorld as (1,1), the terminal state as (2,2), and the horizon as 3. For all states/steps, our behavioral policy samples actions (left, right, up, and down) with the probabilities (0.20,0.10,0.50,0.20)0.200.100.500.20(0.20,0.10,0.50,0.20)( 0.20 , 0.10 , 0.50 , 0.20 ). Other choices for GridWorld environment parameters and behavioral policy appear to generate similar plots. All our plots in this section are averaged over five simulation runs.

Refer to caption

Figure 4: Plotting απ(2)subscriptsuperscript𝛼2𝜋{\alpha}^{(2)}_{\pi}italic_α start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT with varying the evaluation policy π𝜋\piitalic_π. Here λ𝜆\lambdaitalic_λ (evaluation policy probability of taking the down action) corresponds to the X-axis.

We constructed intervals for απ(2)subscriptsuperscript𝛼2𝜋{\alpha}^{(2)}_{\pi}italic_α start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, the treatment effect at step 2222, using both selective and standard uncertainty propagation. In Figure 4, we plot the CIs for απ(2)subscriptsuperscript𝛼2𝜋{\alpha}^{(2)}_{\pi}italic_α start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT as we vary the evaluation policy. With λ𝜆\lambdaitalic_λ parameterizing the evaluation policies. For all states/steps, the evaluation policy corresponding to λ𝜆\lambdaitalic_λ samples actions (left, right, up, and down) with the probabilities (0.25,0.20,0.55λ,λ)0.250.200.55𝜆𝜆(0.25,0.20,0.55-\lambda,\lambda)( 0.25 , 0.20 , 0.55 - italic_λ , italic_λ ). Our CIs are constructed from sampling a dataset of 2000200020002000 episodes. In Figure 4, we see that both selective and standard uncertainty propagation have a similar performance – this is understandable because GridWorld is a dynamic environment (each action has a different next-state distribution), hence estimate/uncertainty propagation is less avoidable here for valid CI construction.

Refer to caption

Figure 5: Learning Experiment on GridWorld

We also compare the policy learning algorithms on the same GridWorld environment, with the same behavioral policy. In Figure 5, we plot the value of policy learnt by SPVI, PVI, and PSL – as we vary the number of training episodes. We see PSL still performs terribly since planning is necessary. However, since GridWorld is a dynamic environment, PVI and SPVI have a similar performance.