skip to main content
article
Free access

Finite-sample analysis of least-squares policy iteration

Published: 01 October 2012 Publication History

Abstract

In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

References

[1]
Y. Abbasi-Yadkori, D. Pal, and Cs. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011.
[2]
A. Antos, Cs. Szepesvári, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89-129, 2008.
[3]
L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30-37, 1995.
[4]
D. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientific, 2007.
[5]
D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
[6]
J. Boyan. Least-squares temporal difference learning. Proceedings of the 16th International Conference on Machine Learning, pages 49-56, 1999.
[7]
S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33-57, 1996.
[8]
V. de la Peña and G. Pang. Exponential inequalities for self-normlized processes with applications. Electronic Communications in Probability, 14:372-381, 2009.
[9]
V. de la Peña, M. Klass, and T. Leung Lai. Pseudo-maximization and self-normalized processes. Propability Surveys, 4:172-192, 2007.
[10]
S. Delattre and S. Gäiffas. Nonparametric regression with martingale increment errors. Stochastic Processes and their Applications, 121(12):2899 - 2924, 2011.
[11]
A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesvári, and S. Mannor. Regularized policy iteration. In Advances in Neural Information Processing Systems 21, pages 441-448. MIT Press, 2008.
[12]
A. M. Farahmand, R. Munos, and Cs. Szepesvári. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010.
[13]
V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer. Classification-based policy iteration with a critic. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, pages 1049-1056, 2011.
[14]
M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. Lstd with random projections. In Advances in Neural Information Processing Systems, pages 721-729, 2010.
[15]
M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of lasso-td. In Proceedings of the 28th International Conference on Machine Learning, pages 1177-1184, 2011.
[16]
L. Györfi, M. Kohler, A. Krzy?zak, and H. Walk. A Distribution-free Theory of Nonparametric Regression. Springer-Verlag, New York, 2002.
[17]
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. In Proceedings of the 25th Conference on Learning Theory, 2012.
[18]
M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107-1149, 2003.
[19]
A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of lstd. In Proceedings of the 27th International Conference on Machine Learning, pages 615-622, 2010.
[20]
R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5-34, April 2000.
[21]
S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, 1993.
[22]
B. Ávila Pires and Cs. Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning, 2012.
[23]
B. Scherrer. Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning, pages 959-966, 2010.
[24]
P. Schweitzer and A. Seidmann. Generalized polynomial approximations in markovian decision processes. Journal of Mathematical Analysis and Applications, 110:568-582, 1985.
[25]
R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
[26]
M. Talagrand. The Generic Chaining: Upper and Lower Bounds of Stochastic Processes. Springer-Verlag, 2005.
[27]
J. Tsitsiklis and B. Van Roy. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674-690, 1997.
[28]
B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94-116, January 1994.
[29]
H. Yu. Convergence of least squares temporal difference methods under general conditions. In Proceedings of the 27th International Conference on Machine Learning, pages 1207-1214, 2010.
[30]
H. Yu and D. Bertsekas. Error bounds for approximations from projected linear equations. Math. Oper. Res., 35(2):306-329, 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 13, Issue 1
January 2012
3712 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 October 2012
Published in JMLR Volume 13, Issue 1

Author Tags

  1. Markov decision processes
  2. finite-sample analysis
  3. generalization bounds
  4. least-squares policy iteration
  5. least-squares temporal-difference
  6. reinforcement learning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)9
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media