skip to main content
10.5555/3009657.3009806guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

Policy gradient methods for reinforcement learning with function approximation

Published: 29 November 1999 Publication History

Abstract

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

References

[1]
Baird, L. C. (1993). Advantage Updating. Wright Lab. Technical Report WL-TR-93-1146.
[2]
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. Proc. of the Twelfth Int. Conf. on Machine Learning, pp. 30-37. Morgan Kaufmann.
[3]
Baird, L. C., Moore, A. W. (1999). Gradient descent for general reinforcement learning. NIPS 11. MIT Press.
[4]
Barto, A. G., Sutton, R. S., Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man, and Cybernetics 13:835.
[5]
Baxter, J., Bartlett, P. (in prep.) Direct gradient-based reinforcement learning: I. Gradient estimation algorithms.
[6]
Bertsekas, D. P., Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
[7]
Cao, X.-R., Chen, H.-F. (1997). Perturbation realization, potentials, and sensitivity analysis of Markov Processes, IEEE Trans. on Automatic Control 42(10):1382-1393.
[8]
Dayan, P. (1991). Reinforcement comparison. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton (eds.), Connectionist Models: Proceedings of the 1990 Summer School, pp. 45-51. Morgan Kaufmann.
[9]
Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth Int. Conf. on Machine Learning, pp. 261-268. Morgan Kaufmann.
[10]
Gordon, G. J. (1996). Chattering in SARSA(λ). CMU Learning Lab Technical Report.
[11]
Jaakkola, T., Singh, S. P., Jordan, M. I. (1995) Reinforcement learning algorithms for partially observable Markov decision problems, NIPS 7, pp. 345-352. Morgan Kaufman.
[12]
Kimura, H., Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. Proc. ICML-98, pp. 278-286.
[13]
Konda, V. R., Tsitsiklis, J. N. (in prep.) Actor-critic algorithms.
[14]
Marbach, P., Tsitsiklis, J. N. (1998) Simulation-based optimization of Markov reward processes, technical report LIDS-P-2411, Massachusetts Institute of Technology.
[15]
Singh, S. P., Jaakkola, T., Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision problems. Proc. ICML-94, pp. 284-292.
[16]
Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning Ph.D. thesis, University of Massachusetts, Amherst.
[17]
Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction MIT Press.
[18]
Tsitsiklis, J. N. Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning 22:59-94.
[19]
Williams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. Technical Report NU-CCS-88-3, Northeastern University, College of Computer Science.
[20]
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229-256.

Cited By

View all

Index Terms

  1. Policy gradient methods for reinforcement learning with function approximation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          NIPS'99: Proceedings of the 13th International Conference on Neural Information Processing Systems
          November 1999
          1070 pages

          Publisher

          MIT Press

          Cambridge, MA, United States

          Publication History

          Published: 29 November 1999

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 03 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media