Article

Policy gradient methods for reinforcement learning with function approximation

Authors:

Richard S. Sutton,

David McAllester,

Satinder Singh,

Yishay MansourAuthors Info & Claims

NIPS'99: Proceedings of the 13th International Conference on Neural Information Processing Systems

Pages 1057 - 1063

Published: 29 November 1999 Publication History

Abstract

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

References

[1]

Baird, L. C. (1993). Advantage Updating. Wright Lab. Technical Report WL-TR-93-1146.

[2]

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. Proc. of the Twelfth Int. Conf. on Machine Learning, pp. 30-37. Morgan Kaufmann.

[3]

Baird, L. C., Moore, A. W. (1999). Gradient descent for general reinforcement learning. NIPS 11. MIT Press.

Digital Library

[4]

Barto, A. G., Sutton, R. S., Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man, and Cybernetics 13:835.

[5]

Baxter, J., Bartlett, P. (in prep.) Direct gradient-based reinforcement learning: I. Gradient estimation algorithms.

[6]

Bertsekas, D. P., Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.

Digital Library

[7]

Cao, X.-R., Chen, H.-F. (1997). Perturbation realization, potentials, and sensitivity analysis of Markov Processes, IEEE Trans. on Automatic Control 42(10):1382-1393.

[8]

Dayan, P. (1991). Reinforcement comparison. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton (eds.), Connectionist Models: Proceedings of the 1990 Summer School, pp. 45-51. Morgan Kaufmann.

[9]

Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth Int. Conf. on Machine Learning, pp. 261-268. Morgan Kaufmann.

Digital Library

[10]

Gordon, G. J. (1996). Chattering in SARSA(λ). CMU Learning Lab Technical Report.

[11]

Jaakkola, T., Singh, S. P., Jordan, M. I. (1995) Reinforcement learning algorithms for partially observable Markov decision problems, NIPS 7, pp. 345-352. Morgan Kaufman.

[12]

Kimura, H., Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. Proc. ICML-98, pp. 278-286.

Digital Library

[13]

Konda, V. R., Tsitsiklis, J. N. (in prep.) Actor-critic algorithms.

[14]

Marbach, P., Tsitsiklis, J. N. (1998) Simulation-based optimization of Markov reward processes, technical report LIDS-P-2411, Massachusetts Institute of Technology.

[15]

Singh, S. P., Jaakkola, T., Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision problems. Proc. ICML-94, pp. 284-292.

[16]

Sutton, R. S. (1984). Temporal Credit Assignment in Reinforcement Learning Ph.D. thesis, University of Massachusetts, Amherst.

Digital Library

[17]

Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction MIT Press.

Digital Library

[18]

Tsitsiklis, J. N. Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning 22:59-94.

Digital Library

[19]

Williams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. Technical Report NU-CCS-88-3, Northeastern University, College of Computer Science.

[20]

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229-256.

Digital Library

Cited By

Zhou THairi FYang HLiu JTong TYang FMomma MGao YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Finite-time convergence and sample complexity of actor-critic multi-objective reinforcement learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694632(61913-61933)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694632
Yamamoto KOko KYang ZSuzuki TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Mean field Langevin actor-criticProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694366(55706-55738)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694366
Wang THerbert SGao SSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Mollification effects of policy gradient methodsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694140(50580-50598)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694140
Show More Cited By

Index Terms

Policy gradient methods for reinforcement learning with function approximation

Index terms have been assigned to the content through auto-classification.

Recommendations

Reinforcement learning algorithms with function approximation: Recent advances and applications

In recent years, the research on reinforcement learning (RL) has focused on function approximation in learning prediction and control of Markov decision processes (MDPs). The usage of function approximation techniques in RL will be essential to deal ...
Value function approximation in reinforcement learning using the fourier basis
AAAI'11: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence

We describe the Fourier basis, a linear value function approximation scheme based on the Fourier series. We empirically demonstrate that it performs well compared to radial basis functions and the polynomial basis, the two most popular fixed bases for ...
Parallel reinforcement learning with linear function approximation
AAMAS '07: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems

In this paper, we investigate the use of parallelization in reinforcement learning (RL), with the goal of learning optimal policies for single-agent RL problems more quickly by using parallel hardware. Our approach is based on agents using the SARSA(λ) ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'99: Proceedings of the 13th International Conference on Neural Information Processing Systems

November 1999

1070 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 29 November 1999

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

333
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou THairi FYang HLiu JTong TYang FMomma MGao YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Finite-time convergence and sample complexity of actor-critic multi-objective reinforcement learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694632(61913-61933)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694632
Yamamoto KOko KYang ZSuzuki TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Mean field Langevin actor-criticProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694366(55706-55738)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694366
Wang THerbert SGao SSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Mollification effects of policy gradient methodsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694140(50580-50598)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694140
Wang JHao QHuang WFan XTang ZWang BHao JLi YBaeza-Yates RBonchi F(2024)DyPS: Dynamic Parameter Sharing in Multi-Agent Reinforcement Learning for Spatio-Temporal Resource AllocationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672052(3128-3139)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3672052
Yu HWang HTiwari DLi JPark S(2024)Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless ComputingProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00045(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00045
Hu KXu KXia QLi MSong ZSong LSun N(2024)An overviewNeurocomputing10.1016/j.neucom.2024.128015598:COnline publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.128015
Yang YLi YHuo YGao ZRui L(2024)Alarm Log Data Augmentation Algorithm Based on a GAN Model and AprioriJournal of Computer Science and Technology10.1007/s11390-024-2408-139:4(951-966)Online publication date: 20-Sep-2024
https://dl.acm.org/doi/10.1007/s11390-024-2408-1
Loffredo AMay MMatta ALanza G(2024)Reinforcement learning for sustainability enhancement of production linesJournal of Intelligent Manufacturing10.1007/s10845-023-02258-235:8(3775-3791)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10845-023-02258-2
Mei JDai BAgarwal AGhavamzadeh MSzepesvári CSchuurmans DOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Ordering-based conditions for global convergence of policy gradient methodsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667462(30738-30749)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667462
Cayci SEryilmaz AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Provably robust temporal difference learning for heavy-tailed rewardsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667239(25693-25711)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667239
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents