Demonstration-Regularized RL

Tiapkin, Daniil; Belomestny, Denis; Calandriello, Daniele; Moulines, Eric; Naumov, Alexey; Perrault, Pierre; Valko, Michal; Menard, Pierre

Statistics > Machine Learning

arXiv:2310.17303 (stat)

[Submitted on 26 Oct 2023 (v1), last revised 10 Jun 2024 (this version, v2)]

Title:Demonstration-Regularized RL

Authors:Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

View PDF HTML (experimental)

Abstract:Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{O}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{O}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.

Comments:	This revision fixes an error due to use of some incorrect results (Lemma 32, Corollary 11 by Talebi & Maillard, 2018) in the proof of Theorem 8. The condition for the RLHF results have slightly changed
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2310.17303 [stat.ML]
	(or arXiv:2310.17303v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2310.17303

Submission history

From: Daniil Tiapkin [view email]
[v1] Thu, 26 Oct 2023 10:54:47 UTC (183 KB)
[v2] Mon, 10 Jun 2024 11:46:34 UTC (180 KB)

Statistics > Machine Learning

Title:Demonstration-Regularized RL

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Demonstration-Regularized RL

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators