HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: environ
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
License: CC BY-NC-ND 4.0
arXiv:2302.00284v2 [cs.LG] 12 Feb 2024
Selective Uncertainty Propagation in Offline RL
Sanath Kumar Krishnamurthy
Stanford University
Shrey Modi
Indian Institute of Technology Bombay
Tanmay Gangwani
Amazon
Sumeet Katariya
Amazon
Branislav Kveton
Amazon
Anshuka Rangi
Amazon
Abstract
We consider the finite-horizon offline reinforcement learning (RL) setting, and are motivated by the challenge of learning the policy at any step in dynamic programming (DP) algorithms. To learn this, it is sufficient to evaluate the treatment effect of deviating from the behavioral policy at step after having optimized the policy for all future steps. Since the policy at any step can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically hard than estimating such treatment effects in the stochastic contextual bandit setting. However, the hardness of many real-world RL instances lies between the two regimes. We develop a flexible and general method called selective uncertainty propagation for confidence interval construction that adapts to the hardness of the associated distribution shift challenges. We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning.
1 Introduction
We study the finite-horizon offline reinforcement learning (RL) problem, focusing on algorithms that adapt to instance hardness. At a high-level, we study algorithms that provide better guarantees for contextual bandit (CB) like instances while being able to plan in more dynamic RL-like instances.
Our work is motivated by real-world RL problems, such as user interaction with an e-commerce search engine (recommendation system). Here, the state can be a user query, and the action is the product recommendation from the engine. When the user wants to buy a particular product, the user often only enters a single product query unrelated to the previous one; thus resembling a sequence of CB problems. On the other hand, when the user explores products, the exploration queries are related through the user’s intent, and the recommendation system may want to steer the user toward the ideal product. Hence, this resembles the RL setting. This indicates the need to develop unified solutions that integrate CB and RL techniques – adapting to instance hardness. We now introduce the CB and RL frameworks in more detail.
Stochastic contextual bandits (CBs) [Langford and Zhang, 2008, Li et al., 2010] and finite-horizon reinforcement learning (RL) [Sutton, 1988, Williams, 1992, Sutton and Barto, 1998] are two fundamental frameworks for decision-making under uncertainty. In stochastic CBs, the environment samples the context and corresponding rewards (for each action) from a fixed but unknown distribution; the agent then observes the context and learns to select the most rewarding action conditioned on the context.
Finite-horizon RL is a generalization of CBs where contexts become states and a sequence of decisions are to be made over steps. Similar to the CB problem, at each step,
the agent observes the current state, selects an action conditioned on the current state, and receives a reward sampled by the environment from a corresponding conditional distribution. However, unlike the CB problem, while the initial state is sampled from a fixed but unknown distribution, the next state at any step depends on the current state and the agent’s action. Hence, the agent can plan to attain high cumulative reward by learning to reach high-value future states.
Unfortunately, the fact that actions can influence future states implies that the agent needs to learn under state-distribution shifts making the RL setting much more statistically harder than CBs in the worst case. For example, [Foster et al., 2021] show that the worst-case sample complexity to learn a non-trivial offline RL policy is either polynomial in the state space size or exponential in other parameters.111[Foster et al., 2021] consider the discounted infinite horizon offline RL formulation. However, one should expect similar lower bounds for the finite horizon offline RL formulation. On the other hand, if actions do not influence next-state distributions at any step, the RL instance would be equivalent to solving stochastic CB instances. On such instances, offline bandit algorithms [Foster and Syrgkanis, 2019] would enjoy a polynomial sample complexity for policy learning with no dependence on state space size. Hence, for such instances, state-of-the-art offline RL algorithms such as pessimistic value function optimization [Jin et al., 2021] may be unnecessarily conservative.
We formalize this dichotomy and show that the statistical hardness of offline RL instances can be captured by the size of actions’ impact on the next state’s distribution. To show this, we consider the high-level structure of dynamic programming (DP) algorithms for offline RL [e.g. Jin et al., 2021]. DP algorithms construct a policy iteratively starting from the policy for the final step and ending by constructing the policy for the first step. At any step , DP algorithms can be viewed to select the policy at step that maximizes the treatment effect of deviating from the behavioral policy at step after having optimized the policy for all future steps. The goal of this paper is to estimate and construct good confidence intervals for this treatment effect at step .
Our primary focus is on confidence interval (CI) construction, which is motivated by the fact that many successful offline RL algorithms learn a policy that maximizes the lower bound of constructed CIs [Jin et al., 2021]. To account for estimation errors from future steps, standard methods for CI construction at any step propagate uncertainty from future steps to the current step . This paper seeks to construct better CIs that adapt to instance hardness by selectively propagating uncertainty.
In cases where all actions have zero estimated impact on next-state distributions, our procedure does not propagate any uncertainty from later steps and still constructs valid CIs for the treatment effect of deviating from the behavioral policy at step after having optimized the policy for all future steps. It treats the instance like a CB problem – hence enjoying a polynomial sample complexity with no dependence on state space size for treatment effect estimation. For more dynamic instances, our procedure must unavoidably propagate more uncertainty from future steps in order to continue constructing valid CIs. In this way, we adapt to the hardness of the instance for CI construction at any step. We also show the benefits of this approach for offline policy learning by proposing an algorithm that optimizes our constructed CIs. Simple simulations further support our claim.
Related Work: Both bandits and RL have been studied extensively [Lattimore and Szepesvari, 2019, Sutton and Barto, 1998, Foster and Rakhlin, 2023]. In bandits, the focus has been on achieving higher statistical efficiency by using the reward distribution of actions [Garivier and Cappe, 2011], prior distribution of model parameters [Thompson, 1933, Agrawal and Goyal, 2012, Chapelle and Li, 2012, Russo et al., 2018], parametric structure [Dani et al., 2008, Abbasi-Yadkori et al., 2011, Agrawal and Goyal, 2013], or agnostic methods [Agarwal et al., 2014]. In RL, the focus has been on different means of learning to plan for longer horizons, such as the value function [Sutton, 1988], policy [Williams, 1992], or their combination [Sutton et al., 2000]. Just as in our work, causal inference insights have helped improve the statistical efficiency of both CB and RL algorithms [Krishnamurthy et al., 2018, Carranza et al., 2023, Syrgkanis and Zhan, 2023]. However, bridging the gap between bandits and RL is an exciting and relatively under-explored research direction. One way to define this gap is to argue that in bandit-like environments, the state never changes once initially sampled. These bandit-like environments can be viewed as a special case of the situation where actions do not impact next-state distributions. With bridging this gap as one motivation, [Zanette and Brunskill, 2019, Yin and Wang, 2021] have used variance-dependent Bernstein bounds to limit uncertainty propagation when there is a lack of next-step value function heterogeneity. Another approach is to define this gap in a binary fashion. Either there is no impact of actions on next state distributions, or we are in a dynamic MDP environment. In an online setting, [Zhang et al., 2022] develop hypothesis tests to differentiate between the two situations and then select the most appropriate exploration algorithm. While their higher-level framing is similar to ours and their approach is novel, their approach cannot outperform existing RL algorithms in MDP environments. By interpolating between the two regimes, we hope to outperform bandit and existing RL algorithms that either forgo planning or are too conservative in accounting for actions’ impact on next-state distributions.
2 Preliminaries
Setting: We consider an episodic Markov Decision Process (MDP) setting with state space , action space , horizon , and transition kernel . At every episode, the environment samples a starting state and a set of realized rewards from a fixed but unknown distribution . Here is a map from to . For any states and action , denotes the probability density of transitioning to state conditional on taking action at state during step . A trajectory is a sequence of states, actions, and rewards. That is, any trajectory is given by .
A policy is a sequence of action sampling kernels , where denotes the probability of sampling action at state during step under the policy . We let denote the induced distribution over trajectories under the policy . For any policy , we define the (state-) value function at each step such that,
(1)
The value of policy is given by . We can take expectation over instead of here since the only random variable in is the initial state which does not depend on the choice of the policy .
For any step , we let be a function from to denoting the expected reward function for step . That is, . With some abuse of notation, for any , we let and . That is, is the expected reward at state and step under the policy . Similarly, is the expected transition probability from to at step under the policy . For any step , we also let denote a bound on the maximum value can take for any state and policy .
It is also equivalent to define the value functions () using the iterative definition in (2), where .
(2)
Data Collection Process: In this paper, we focus on the offline setting [Levine et al., 2020] with training data collected under a behavioral policy . Apart from the policy , the learner only has access to a dataset consisting of trajectories sampled from the distribution , where is the data sampling distribution induced by . That is, , where . Since the transitions in these trajectories are induced by the behavioral policy, for notational convenience, we let denote the transition kernel under the policy . That is, for any , .
2.1 Estimand of Interest
We now turn our attention to defining our target estimand, which refers to the specific quantity we aim to estimate. Consider a fixed policy and suppose we would like to estimate its value. Since estimating the value of the behavioral policy is easy (empirical average of total observed reward in each trajectory), we argue that that it is sufficient to estimate – the difference in values between evaluation and behavioral policy. This difference can be further decomposed. For each step , let be the policy that follows the behavioral policy upto step and then follows the evaluation policy. In (3), we decompose the difference in policy value between the evaluation and behavioral policy into the sum of differences in policy value between and for each step .
(3)
Here (i) follows from telescoping and (ii) follows from the fact that the policies and agree with the behavioral policy for the first steps. We let , the term corresponding to step in the above decomposition, be our estimand of interest. That is, our estimand () is the difference in value of policies and – these policies only differ in the current step , which may cause difference in immediate rewards and may also cause a difference in in next-state distributions (affecting future rewards even if the policies at future steps are the same).
(4)
We now seek to justify as an important estimand, and start by arguing that it is a reasonable estimand to care about. Note that, given the decomposition in (3), estimating and constructing CIs for allows us to estimate and construct CIs for (the difference in policy value between evaluation and behavioral policies) – and thus allows us to estimate and construct CIs for (evaluation policy value).
Beyond being an effective surrogate for policy evaluation, is an important quantity to consider in dynamic programming (DP) algorithms. DP algorithms construct the policy for the final step () and iteratively construct policies for earlier steps. At step , the policy at steps to are already fixed/computed. Hence at this step, one can interpret DP algorithms as attempting to select in order to maximize – that is, maximize the treatment effect of deviating from the behavioral policy at step h after having optimized the policy for all future steps. Hence, for any step , is a helpful estimand to consider for decision-making at step .
Importantly for us, when actions at step do not affect next state distributions, the problem of choosing a policy at step can be viewed as a CB problem. Helpfully in this case, unlike policy value, only depends on immediate rewards and can be estimated via offline stochastic CB techniques. However, when actions at step do influence next state distributions, RL techniques are necessary for estimating . Hence, beyond being a critical quantity for decision-making at step , it is also a quantity that is amenable to interpolating between CB and RL techniques. Thus, our paper focuses on estimating and constructing tight confidence intervals (CIs) for this estimand ().
3 Shift Model
Offline RL is more challenging than offline policy learning in the stochastic CB setting [Foster et al., 2021]. The primary reason for the difference between the two settings is due to state distribution shift induced due to deviating from the behavioral policy. Distribution shift makes any statistical learning theory problem challenging [Vergara et al., 2012, Bobu et al., 2018, Farshchian et al., 2018]. Hence methods that adapt to instance hardness must rely on some implicit or explicit approach to measure this state-distribution shift. To this end, we model the “heterogeneous treatment effect" [Künzel et al., 2019, Nie and Wager, 2021] of actions on the next-state distribution and refer to this effect as the “shift model". More precisely, we define the shift model in (5).
(5)
Here captures the shift in the probability of transitioning from to due to selecting action at state instead of following the behavioral policy. With some abuse of notation, for any , we let . That is, is the expected shift (w.r.t to ) in probability of transitioning from to at step under the policy .
It is worth noting that shifts are bounded. For all , since the is a difference of two state-distributions, we have from triangle inequality.
We argue that shift helps capture instance hardness for estimating . To see this, we provide a shift-dependent expression for .
(6)
Here (i) follows from (4) (definition of ); and (ii) follows from (2) and (5). Note that, in the final expression of (6), the first term can be estimated using stochastic CB techniques and the dependence on next-step value function is scaled by the size of this shift. This hints at the possibility of developing methods that interpolate between CB and RL techniques. More formally, in Section4, we show shift estimates enable us to adapt to the hardness of our setting – when estimating and constructing CIs for .
4 Theory: Selective Propagation
In Section2, we motivated and defined our estimand (see (4)) – which is the treatment effect for deviating from the behavioral policy at step after having already deviated from the behavioral policy for all future steps. We now present an approach to estimate and construct tight valid CIs for – with interval size adapting to instance hardness. Here harder instances have a larger next-state distribution shifts when deviating from the behavioral policy. When shifts are smaller, we can rely more on statistically efficient CB methods. However when shifts are larger (instance is more dynamic), we unavoidably must rely more on RL methods that account for worst-case distribution shifts.
Our approach to estimate and construct tight valid CIs for requires several inputs. These inputs, described in Section4.1, allow us to abstract away existing approaches to tackle well-studied estimation problems in CB and RL settings. In Section4.2, we describe how to combine these existing tools to achieve guarantees that adapt to instance hardness.
4.1 Inputs
Our method interpolates between existing tools for CB and RL settings, by leveraging shift estimates. To simplify our analysis and generalize our results, we assume access to these estimates as inputs to our interpolation method. In particular, we take as input: (1) offline CB treatment effect estimate and corresponding CI, (2) optimistic and pessimistic offline RL value function estimates, and (3) shift estimates with average error bounds. As the quality of our inputs improve (potentially as better estimators get developed), the quality of our outputs will correspondingly improve.
We now formally describe these inputs – requiring all the associated high-probability bounds to hold simultaneously with probability at least . We start by describing the first input, which is based on CB methods.
Input 1 (CB estimates): This input provides an estimate and CI for (formally defined in (7)) – which is the average treatment effect on the immediate reward for deviating from the behavioral policy at step .
(7)
Since only depends on the immediate reward, well-established offline CB techniques [e.g., Dudik et al., 2014] can be used to estimate and construct CIs for the difference (in terms of immediate rewards) between these policies. We let be our input estimate and let be the input CI radius. That is, the confidence interval is given by (8).
(8)
When deviating from the behavioral policy at step has no impact on next-state distributions, the estimate and CI for can be used as the estimate and CI for . However, when there is an impact on next-state distributions, valid estimation and CI construction for requires us to propagate estimates and uncertainty from future steps to the current step. To enable this propagation, we take estimates for as our second input.
Input 2 (RL estimates): This input provides pessimistic, standard, and optimistic estimates for – denoted by , , and respectively – such that the ordering in (9) holds.222Note that (9) can be enforced during input construction. Recall that denotes a bound on the maximum value can take for any state . Further, with high probability, we require (10) holds – that is, the true value function is bounded by the pessimistic and optimistic value function estimates.
(9)
(10)
There is a large and growing literature on value function estimation in RL, including optimistic and pessimistic value function estimation that are designed to satisfy (10) [e.g., Martin et al., 2017, Wang et al., 2019, Jin et al., 2021]. Thus, we can employ the most cutting-edge methods to construct these next-step value function estimates.
Input 2 gave us estimates for (next-step value), which we may need to propagate to the current step – when constructing an estimate and CI for . Since our goal is to interpolate between tight CB guarantees and always valid RL guarantees, unlike traditional RL algorithms, we want to be selective in propagating next-step estimates/uncertainties. Our final input, shift estimates, allows us to only propagate these estimates when required – enabling our adaptation to instance hardness.
Input 3 (Shift estimates): This input provides a estimate for (see (5)) and an associated error bound – denoted by and respectively – such that (11) holds (recall from Section3 that satisfies the same bound). We also require (12) holds with high-probability.
(11)
(12)
Since the true shift model is a function of the true transition model (see Section3), shift can be estimated via first estimating transition model and then calculating the treatment effect (due to deviating from the behavioral policy) of transitioning between any pair of states [see Moerland et al., 2023, for a survey on model-based RL and transition model estimation.]. 333As a treatment effect model, shift may also be estimated via heterogeneous treatment effect estimators [e.g., Nie and Wager, 2021, Künzel et al., 2019].
4.2 Combining Inputs
We now have all our required input estimates, and can state our main result (Theorem4.1) on constructing an estimate and CI for – in a way that adapts to instance hardness.
Theorem 4.1.
Suppose we have: (1) CB inputs ; (2) RL inputs satisfying (9); and (3) shift inputs satisfying (11) – such that (8), (10), and (12) hold with probability at least .
Moreover, suppose we have a (holdout) dataset of trajectories – sampled from the distribution – that were not used for constructing input estimates.444Utilizing a holdout set for estimating and constructing CIs for allows us to treat our input estimates as fixed quantities (independent of the randomness in the sampled holdout dataset). Our estimate for is then denoted by and given by (13).
(13)
Now for some fixed , with probability at least , we have the confidence interval in (14) holds.
Here is the difference between the optimistic and pessimistic estimates – it captures the uncertainty in the next-step value function estimates. That is, for all , .
One of the advantages of in-distribution supervised learning is that excess risk bounds only depend on complexity of hypothesis class (and number of training samples), with no dependence on size of feature space [see Shalev-Shwartz and Ben-David, 2014]. As discussed in Section1, the statistical challenges of RL stem from the fact that learning under (state) distribution shifts is hard. For example, without additional assumptions, optimistic/pessimistic value function estimation have an unavoidable polynomial dependency on state-space size [Foster and Rakhlin, 2023]. Our goal is to avoid/minimize such dependencies when possible. The key benefit of Theorem4.1 is that both our estimate and our CI width are “selective" in propagating/utilizing the RL estimates from input 2 – allowing us to only suffer from the slower worst-case RL estimation rates on harder instances. To better understand this, let us dig deeper into the terms in our CI width ().
Note that and (see Inputs 1 and 3) bound errors averaged under the behavioral policy state-distribution – that is, they bound in-distribution average errors. Hence, with appropriate inputs, the first two terms in shrink quickly with no dependency on state-space size [Dudik et al., 2014, Shalev-Shwartz and Ben-David, 2014]. The third term in , which enjoys a rate, also shrinks quickly to zero and has no dependency on state-space size.
We now only need to discuss the fourth and final term in . Unlike the previous terms which bound in-distribution average errors, this term does depend on per-state (point-wise) errors (). The reason RL algorithms seek to bound per-state errors is because these guarantees do not depend on the state-distribution and are valid under any shift. We now illustrate how this robustness to state-distribution shift comes at a cost of larger error bounds, and argue that it is advantageous to scale these terms down with the estimated shift. First, as a sanity check, we show that this term is finite.
(16)
where (i) follows from Hölder’s inequality, and (ii) follows from (11). Now that we know this term is finite, we can argue that it shrinks to zero. Since captures the size of per-state errors for pessimistic/optimistic value function estimates, we can expect this term to converges to zero in the limit with infinite data. The size of must depend on how often states similar to were visited at step in the training data for the RL input. For simplicity, let us consider the scenario when all states are visited uniformly at step under the distribution . In such a scenario, the frequency at which states similar to were visited at step would depend on some measure of the size of the state space . This would imply that shrinks at a rate that depends on some measure of the size of the state space . That is, while these terms shrink, they shrink slowly. Hence per-state bounds, while independent of state-distribution, come at a cost of slower statistical rates. As shown in [Foster et al., 2021], such a dependence of confidence interval width on state-space size is unavoidable in the worst-case.
The key message of Theorem4.1 is that we can move beyond this worst-case scenario by scaling these point-wise errors with the estimated shifts . For example, when , the fourth term in is zero, allowing us to recover contextual bandit-style guarantees that are independent of state-space size. It is worth noting that, even when state-space size is not a concern, being selective about error/estimate propagation can improve the resulting interval widths.
5 Selectively Pessimistic Value Iteration
Algorithm 1 Selectively Pessimistic Value Iteration
Inputs: Estimated reward model , estimated transition model , induced shift model , and estimation uncertainty bonuses .
Initialize .
fordo
For all ,
Estimate Q value.
(17)
For all ,
Selective pessimism.
(18)
For all ,
Estimate pessimistic value.
(19)
For all ,
Estimate optimistic value.
(20)
For all ,
Estimate value.
(21)
endfor
To demonstrate the benefits of our insights from Section4, we now present an algorithm for offline policy learning. At a high-level, we modify pessimistic value iteration (PVI) [Jin et al., 2021], a dynamic programming (DP) algorithm that chooses the policy at each step by maximizing pessimistic value function estimates. Pessimistic value maximization helps PVI avoids model exploitation – that is, PVI avoids picking policies with inaccurately high estimated values at step by penalizing value function estimate uncertainty. We argue that, depending on the instance, such penalties due to estimate uncertainty may be larger than necessary for selecting the policy at step .
In particular, based on the results in Section4, uncertainty from later steps does not always need to be fully propagated to estimate lower bounds for the effect of deviating from the behavioral policy at step (after fixing the policy for all future steps). Since we can view maximizing this treatment effect as the goal of DP algorithms at any step , maximizing the tighter lower bounds from Section4 should allow us to do better on easier CB-like instances (while always avoiding model exploitation) – motivating the design of Algorithm1.
PVI often relies on point-wise estimation uncertainty bonus to account for reward and transition estimation errors.555In finite-state MDPs, these bonuses can be count based, where and is the number of times state and action are observed at step in the data set . Here is an algorithmic parameter. Several papers have extended these ideas to continuous state spaces [Bellemare et al., 2016, Osband et al., 2021]. For any , bounds the total model estimation errors at . At any , PVI uses bonuses to account for estimation errors at step and propagate estimation errors from future steps (by propagating pessimistic values from step that were computed using these bonuses). We also similarly rely on these bonuses to account for model estimation errors.
Our algorithm (Algorithm1), selectively pessimistic value iteration (SPVI), takes as input: an estimated reward model , an estimated transition model , an induced shift model such that for all , and a point-wise estimation uncertainty bonus . No additional data is required. The algorithm uses these inputs to construct a policy iteratively. That is, SPVI constructs while iterating over steps to .
At step , SPVI computes: (1) the Q-value estimate for step (which is well defined since is fixed) denoted by (see (17)), (2) the policy at step denoted by (see (18)), and (3) pessimistic (), optimistic (), and standard () value functions for step (see (19), (20), and (21)). Steps one and three are fairly standard, as several RL algorithms construct Q-values and pessimistic/standard/optimistic value functions. Hence, for brevity, we only explain step 2 (constructing ) – the crux of our modification.
As we argued earlier, one can view the goal at step as selecting in order to maximize . We construct a tight lower bound for and argue that step 2 maximizes this bound. Similar to standard pessimistic value estimation, we let the bonus bound the total model estimation errors at . Now from the analysis in Theorem4.1, we have (22) is a valid lowed bound on .
(22)
Here is a shorthand for ; and for any we let be a shorthand for . Now from (18), since for all , we have that step 2 (constructing ) is equivalent to maximizing the lower bound in (22). Importantly, by maximizing this tight lower bound we can do better on easier CB-like instances (while always avoiding model exploitation by penalizing uncertainty in estimating ). This completes our justification of Algorithm1 (SPVI).
6 Simulation
To illustrate our insights, we consider a simple toy environment called “ChainBandit". As described in Figure1, this environment is constructed to have both dynamic (some actions result in a different next-state distribution) and bandit-like (non-dynamic) elements. Ensuring that while planning and some estimate/uncertainty propagation is necessary, complete uncertainty propagation can be unnecessary to evaluate policies of interest. Throughout this section, we consider ChainBandit with a chain length of 3 and consider the following behavioral policy () – at every state and step, selects action with probability and selects the other two actions with probability respectively. From the data collected, we evaluate standard and selective uncertainty propagation for tasks of: (1) estimating upper/lower bounds for ; and (2) offline policy learning.
Since uncertainty propagation is the focus of this simulation, in order to make a fair comparison, both standard and selective propagation: use the same tabular approach to estimate a (step independent) reward/transition model; and use the same (step independent) count-based bonuses to account for model estimation errors.666The model estimates and bonuses are not step dependent since the reward/transition functions in Chain Bandit are the same for all steps. Further a tabular approach to reward (transition) estimation simply indicates using the average reward (average one-hot next-state vector) observed at any state-action pair as its reward (transition) estimate. Here bonus is given by – where is the number of times action was taken at state , and confidence parameter .
For the step and policy of interest, standard CIs for are constructed using pessimistic/optimistic value estimates at step for policies and – i.e, utilizing (23).
(23)
For selective CIs, we use (22) to construct the lower bound and similarly construct the upper bound. Note that converting inequalities like (22) and (23) into empirical bounds is straightforward since our training is from . Figure2 plots CIs for both selective and standard uncertainty propagation, when varying the evaluation policy. As expected, benefits over standard pessimism are larger when next-state distribution shift is smaller – that is, when evaluation policy is closer to the behavioral policy.
In Figure3, we plot the value of learnt policy from various algorithms as we vary the number of training episodes. In particular, we compare SPVI, PVI [Jin et al., 2021], and pessimistic supervised learning (PSL). Here PSL refers to a pessimistic bandit policy optimization applied to each step without planning. The ChainBandit environment benefits from planning, so PSL performs poorly as expected.
On all Chain Bandit simulations we tried, SPVI was by far the best-performing algorithm. The reason we considered the behavioral policy described earlier was that it was more disadvantageous for SPVI. In particular, since selective pessimism has an initial bias against policies that lead to significant shifts, we chose a highly sub-optimal behavioral policy (selecting with probability ). While this leads to a worse start for SPVI than PVI, eventually, SPVI outperforms the other algorithms. We also run simulations for CI construction and policy learning on the standard GridWorld (see AppendixB) – since this is a very dynamic environment, both standard and selective propagation have similar performance.777All algorithm runs takes less than 2 mins on a MacBook Pro M2 16GB.
Conclusion: We introduce selective propagation, a general approach to interpolate between CB and RL techniques – achieving guarantees that adapt to instance hardness. Further developing this could impact real world problems (e.g., recommendation systems, mHealth, EdTech) that lie in between the two settings.
References
Abbasi-Yadkori et al. [2011]
Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari.
Improved algorithms for linear stochastic bandits.
In Advances in Neural Information Processing Systems 24, pages
2312–2320, 2011.
Agarwal et al. [2014]
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert
Schapire.
Taming the monster: A fast and simple algorithm for contextual
bandits.
In Proceedings of the 31st International Conference on Machine
Learning, pages 1638–1646, 2014.
Agrawal and Goyal [2012]
Shipra Agrawal and Navin Goyal.
Analysis of Thompson sampling for the multi-armed bandit problem.
In Proceeding of the 25th Annual Conference on Learning
Theory, pages 39.1–39.26, 2012.
Agrawal and Goyal [2013]
Shipra Agrawal and Navin Goyal.
Thompson sampling for contextual bandits with linear payoffs.
In Proceedings of the 30th International Conference on Machine
Learning, pages 127–135, 2013.
Bellemare et al. [2016]
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton,
and Remi Munos.
Unifying count-based exploration and intrinsic motivation.
Advances in neural information processing systems, 29, 2016.
Bobu et al. [2018]
Andreea Bobu, Eric Tzeng, Judy Hoffman, and Trevor Darrell.
Adapting to continuously shifting domains.
workshop, 2018.
Carranza et al. [2023]
Aldo Gael Carranza, Sanath Kumar Krishnamurthy, and Susan Athey.
Flexible and efficient contextual bandits with heterogeneous
treatment effect oracles.
In International Conference on Artificial Intelligence and
Statistics, pages 7190–7212. PMLR, 2023.
Chapelle and Li [2012]
Olivier Chapelle and Lihong Li.
An empirical evaluation of Thompson sampling.
In Advances in Neural Information Processing Systems 24, pages
2249–2257, 2012.
Dani et al. [2008]
Varsha Dani, Thomas Hayes, and Sham Kakade.
Stochastic linear optimization under bandit feedback.
In Proceedings of the 21st Annual Conference on Learning
Theory, pages 355–366, 2008.
Dudik et al. [2014]
Miroslav Dudik, Dumitru Erhan, John Langford, and Lihong Li.
Doubly robust policy evaluation and optimization.
Statistical Science, 29(4):485–511, 2014.
Farshchian et al. [2018]
Ali Farshchian, Juan A Gallego, Joseph P Cohen, Yoshua Bengio, Lee E Miller,
and Sara A Solla.
Adversarial domain adaptation for stable brain-machine interfaces.
arXiv preprint arXiv:1810.00045, 2018.
Foster and Rakhlin [2023]
Dylan J Foster and Alexander Rakhlin.
Foundations of reinforcement learning and interactive decision
making.
arXiv preprint arXiv:2312.16730, 2023.
Foster and Syrgkanis [2019]
Dylan J Foster and Vasilis Syrgkanis.
Orthogonal statistical learning.
arXiv preprint arXiv:1901.09036, 2019.
Foster et al. [2021]
Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu.
Offline reinforcement learning: Fundamental barriers for value
function approximation.
arXiv preprint arXiv:2111.10919, 2021.
Garivier and Cappe [2011]
Aurelien Garivier and Olivier Cappe.
The KL-UCB algorithm for bounded stochastic bandits and beyond.
In Proceeding of the 24th Annual Conference on Learning
Theory, pages 359–376, 2011.
Jin et al. [2021]
Ying Jin, Zhuoran Yang, and Zhaoran Wang.
Is pessimism provably efficient for offline rl?
In International Conference on Machine Learning, pages
5084–5096. PMLR, 2021.
Krishnamurthy et al. [2018]
Akshay Krishnamurthy, Zhiwei Steven Wu, and Vasilis Syrgkanis.
Semiparametric contextual bandits.
In International Conference on Machine Learning, pages
2776–2785. PMLR, 2018.
Künzel et al. [2019]
Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu.
Metalearners for estimating heterogeneous treatment effects using
machine learning.
Proceedings of the national academy of sciences, 116(10):4156–4165, 2019.
Langford and Zhang [2008]
John Langford and Tong Zhang.
The epoch-greedy algorithm for contextual multi-armed bandits.
In Advances in Neural Information Processing Systems 20, pages
817–824, 2008.
Lattimore and Szepesvari [2019]
Tor Lattimore and Csaba Szepesvari.
Bandit Algorithms.
Cambridge University Press, 2019.
Levine et al. [2020]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu.
Offline reinforcement learning: Tutorial, review.
and Perspectives on Open Problems, 2020.
Li et al. [2010]
Lihong Li, Wei Chu, John Langford, and Robert Schapire.
A contextual-bandit approach to personalized news article
recommendation.
In Proceedings of the 19th International Conference on World
Wide Web, 2010.
Martin et al. [2017]
Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, and Marcus Hutter.
Count-based exploration in feature space for reinforcement learning.
arXiv preprint arXiv:1706.08090, 2017.
Moerland et al. [2023]
Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al.
Model-based reinforcement learning: A survey.
Foundations and Trends® in Machine Learning,
16(1):1–118, 2023.
Nie and Wager [2021]
Xinkun Nie and Stefan Wager.
Quasi-oracle estimation of heterogeneous treatment effects.
Biometrika, 108(2):299–319, 2021.
Osband et al. [2021]
Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza
Ibrahimi, Xiuyuan Lu, and Benjamin Van Roy.
Epistemic neural networks.
arXiv preprint arXiv:2107.08924, 2021.
Russo et al. [2018]
Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen.
A tutorial on Thompson sampling.
Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
Shalev-Shwartz and Ben-David [2014]
Shai Shalev-Shwartz and Shai Ben-David.
Understanding machine learning: From theory to algorithms.
Cambridge university press, 2014.
Sutton [1988]
Richard Sutton.
Learning to predict by the methods of temporal differences.
Machine Learning, 3:9–44, 1988.
Sutton and Barto [1998]
Richard Sutton and Andrew Barto.
Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 1998.
Sutton et al. [2000]
Richard Sutton, David McAllester, Satinder Singh, and Yishay Mansour.
Policy gradient methods for reinforcement learning with function
approximation.
In Advances in Neural Information Processing Systems 12, pages
1057–1063, 2000.
Syrgkanis and Zhan [2023]
Vasilis Syrgkanis and Ruohan Zhan.
Post-episodic reinforcement learning inference.
arXiv preprint arXiv:2302.08854, 2023.
Thompson [1933]
William R. Thompson.
On the likelihood that one unknown probability exceeds another in
view of the evidence of two samples.
Biometrika, 25(3-4):285–294, 1933.
Vergara et al. [2012]
Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A Ryan, Margie L Homer,
and Ramón Huerta.
Chemical gas sensor drift compensation using classifier ensembles.
Sensors and Actuators B: Chemical, 166:320–329,
2012.
Wang et al. [2019]
Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy.
Optimism in reinforcement learning with generalized linear function
approximation.
arXiv preprint arXiv:1912.04136, 2019.
Williams [1992]
Ronald Williams.
Simple statistical gradient-following algorithms for connectionist
reinforcement learning.
Machine Learning, 8(3-4):229–256, 1992.
Yin and Wang [2021]
Ming Yin and Yu-Xiang Wang.
Towards instance-optimal offline reinforcement learning with
pessimism.
Advances in neural information processing systems,
34:4065–4078, 2021.
Zanette and Brunskill [2019]
Andrea Zanette and Emma Brunskill.
Tighter problem-dependent regret bounds in reinforcement learning
without domain knowledge using value function bounds.
In International Conference on Machine Learning, pages
7304–7312. PMLR, 2019.
Zhang et al. [2022]
Kelly W Zhang, Omer Gottesman, and Finale Doshi-Velez.
A bayesian approach to learning bandit structure in markov decision
processes.
arXiv preprint arXiv:2208.00250, 2022.
Selective Uncertainty Propagation in Offline RL
(Supplementary Material)
In this section we prove our main theoretical result, Theorem4.1. For convenience, we restate it below.
See 4.1
We now begin the proof of Theorem4.1. We start by proving Lemma A.1, which shows is close to our target estimand. Where is defined in (24) and can be viewed as a less empirical version of (our treatment effect estimator at step ).
(24)
Lemma A.1.
Under the conditions of Theorem4.1, we show the following bound holds with probability :
Here (i) follows from (2), (ii) follows from re-arranging terms, and (iii) follows from (7) and (5).
With probability , we have the input guarantees of Section4 hold. Hence, utilizing these guarantees, we can bound the distance between and the target estimand .
(28)
Here (i) follows from (24) and (27), (ii) follows from triangle inequality and the input guarantee, (iii) follows from triangle inequality, and finally (iv) follows from Cauchy-Schwarz inequality and the input guarantees.
∎
Having shown in Lemma A.1 that is close to our target estimand, we only need to show two more facts to complete the proof of Theorem4.1: (i) we need to show is close to our treatment effect estimator (defined in (13)), and (ii) we need to show is sufficiently smaller that . We show both these statements hold with high-probability.
We start with showing is close to with high-probability. In particular, we have (29) holds with probability at least .
(29)
In (29), the first equality follows from (13) and (24). The first inequality follows from Hoeffding’s inequality the fact that .
From Hoeffding’s inequality and the fact that (16) holds, we also have (30) holds with probability at least .
Hence, with probability at least , we have (25), (29), and (30) hold. We now use these equation to show that (31) holds.
(31)
Here (i) follows from triangle inequality; (ii) follows from (25) and (29); and (iii) follows from (30). Therefore, we have shown that (31) holds with probability at least . This completes the proof of Theorem4.1.
Appendix B Additional Simulation
Similar to Section6, we now run simulations for the standard GridWorld environment (with width and height ). Here states are discrete points on a bounded two-dimensional grid. The agent’s starting state is sampled uniformly at random from the grid, and the agent should learn to reach a specified goal state (which is an absorbing state). Upon transitioning to the goal state, the agent receives a reward of one and receives a reward of zero otherwise. The agent has 4 actions (left, right, up, and down). These actions make the agent move one step in that direction if possible. If the agent is at the boundary of the grid and can’t move in the direction selected, the agent continues to stay in the same state. Since we make the goal state an absorbing state, all actions at this state lead to the agent continuing to stay in this state. We set the starting state for the GridWorld as (1,1), the terminal state as (2,2), and the horizon as 3. For all states/steps, our behavioral policy samples actions (left, right, up, and down) with the probabilities . Other choices for GridWorld environment parameters and behavioral policy appear to generate similar plots. All our plots in this section are averaged over five simulation runs.
We constructed intervals for , the treatment effect at step , using both selective and standard uncertainty propagation. In Figure4, we plot the CIs for as we vary the evaluation policy. With parameterizing the evaluation policies. For all states/steps, the evaluation policy corresponding to samples actions (left, right, up, and down) with the probabilities . Our CIs are constructed from sampling a dataset of episodes. In Figure4, we see that both selective and standard uncertainty propagation have a similar performance – this is understandable because GridWorld is a dynamic environment (each action has a different next-state distribution), hence estimate/uncertainty propagation is less avoidable here for valid CI construction.
We also compare the policy learning algorithms on the same GridWorld environment, with the same behavioral policy. In Figure5, we plot the value of policy learnt by SPVI, PVI, and PSL – as we vary the number of training episodes. We see PSL still performs terribly since planning is necessary. However, since GridWorld is a dynamic environment, PVI and SPVI have a similar performance.