research-article

Reward Shaping in Episodic Reinforcement Learning

Author:

Marek GrześAuthors Info & Claims

AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Pages 565 - 573

Published: 08 May 2017 Publication History

Abstract

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of reinforcement learning in various sectors, such as healthcare and cyber-security, among others. However, reinforcement learning can be time-consuming because the learning algorithms have to determine the long term consequences of their actions using delayed feedback or rewards. Reward shaping is a method of incorporating domain knowledge into reinforcement learning so that the algorithms are guided faster towards more promising solutions. Under an overarching theme of episodic reinforcement learning, this paper shows a unifying analysis of potential-based reward shaping which leads to new theoretical insights into reward shaping in both model-free and model-based algorithms, as well as in multi-agent reinforcement learning.

References

[1]

P. Abbeel, M. Quigley, and A. Y. Ng. Using inaccurate models in reinforcement learning. In Proc. of ICML, pages 1--8. ACM, 2006.

Digital Library

[2]

J. Asmuth, M. L. Littman, and R. Zinkov. Potential-based shaping in model-based reinforcement learning. In Proceedings of AAAI, 2008.

Digital Library

[3]

C. Boutilier. Sequential optimality and coordination in multiagent systems. In Proceedings of the International Joint Conferrence on Artificial Intelligencekue, pages 478--485, 1999.

Digital Library

[4]

C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1--94, 1999.

Digital Library

[5]

R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. JMLR, 3:213--231, 2002.

Digital Library

[6]

S. Devlin and D. Kudenko. Theoretical considerations of potential-based reward shaping for multi-agent systems. In Proceedings of AAMAS, 2011.

Digital Library

[7]

S. Devlin and D. Kudenko. Dynamic potential-based reward shaping. In Proceedings of AAMAS, 2012.

Digital Library

[8]

A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko. Potential-based reward shaping for finite horizon online POMDP planning. Journal of Autonomous Agents and Multiagent Systems, 30(3):403--445, 2016.

Digital Library

[9]

M. Grzes. Improving exploration in reinforcement learning through domain knowledge and parameter analysis. PhD thesis, University of York, 2010.

[10]

M. Grzes and D. Kudenko. Online learning of shaping rewards in reinforcement learning. Neural Networks, 23:541--550, 2010.

Digital Library

[11]

J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. SPUDD: Stochastic planning using decision diagrams. In Proc. of UAI, pages 279--288, 1999.

Digital Library

[12]

L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237--285, 1996.

Digital Library

[13]

M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209--232, 2002.

Digital Library

[14]

L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Proc. of ECML, number 4012 in LNCS, pages 282--293. Springer, 2006.

Digital Library

[15]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. of NIPS, pages 1097--1105. 2012.

Digital Library

[16]

L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293--321, 1992.

Digital Library

[17]

X. Lu, H. M. Schwartz, and S. N. Givigi. Policy invariance under reward transformations for general-sum stochastic games. Journal of Artificial Intelligence Research, 41(2):397--406, May 2011.

Digital Library

[18]

B. Marthi. Automatic shaping and decomposition of reward functions. In Proceedings of the 24th International Conference on Machine Learning, pages 601--608, 2007.

Digital Library

[19]

M. J. Mataric. Reward functions for accelerated learning. In Proceedings of the 11th International Conference on Machine Learning, pages 181--189, 1994.

Digital Library

[20]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529--533, 02 2015.

[21]

A. Moore. Efficient Memory-Based Learning for Robot Control. PhD thesis, University of Cambridge, November 1990.

[22]

A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278--287, 1999.

Digital Library

[23]

J. Peng and R. J. Williams. Efficient learning and planning within the dyna framework. In Proceedings of the 1993 IEEE International Conference on Neural Networks, pages 168--174, 1993.

[24]

F. Pommerening, G. Röger, M. Helmert, and B. Bonet. Heuristics for cost-optimal classical planning based on linear programming. In Proc. of IJCAI, pages 4303--4309, 2015.

Digital Library

[25]

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.

Digital Library

[26]

J. Randløv. Solving Complex Problems with Reinforcement Learning. PhD thesis, University of Copenhagen, 2001.

[27]

A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74:1309--1331, 2008.

Digital Library

[28]

I. Szita and C. Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proc. of ICML, pages 1031--1038, 2010.

[29]

F. Trevizan, S. Thiébaux, P. Santana, and B. Williams. Heuristic search in dual space for constrained stochastic shortest path problems. In Proc. of AAAI, 2016.

Cited By

Serrano-Cuevas JMorales EHernández-Leal P(2020)Safe reinforcement learning using risk mapping by similarityAdaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems10.1177/105971231985965028:4(213-224)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.1177/1059712319859650
Du YHan LFang MDai TLiu JTao DWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)LIIRProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454683(4403-4414)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454683
Demir AÇilden EPolat FElkind EVeloso MAgmon NTaylor M(2019)Landmark Based Reward Shaping in Reinforcement Learning with Hidden StatesProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3331964(1922-1924)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.5555/3306127.3331964
Show More Cited By

Index Terms

Reward Shaping in Episodic Reinforcement Learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning
    2. Machine learning algorithms
      1. Dynamic programming for Markov decision processes
        Q-learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning
        Multi-agent reinforcement learning
        Sequential decision making

Recommendations

Multi-agent, reward shaping for RoboCup KeepAway
AAMAS '11: The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 3

This paper investigates the impact of reward shaping in multi-agent reinforcement learning as a way to incorporate domain knowledge about good strategies. In theory [2], potential-based reward shaping does not alter the Nash Equilibria of a stochastic ...
Theoretical considerations of potential-based reward shaping for multi-agent systems
AAMAS '11: The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 1

Potential-based reward shaping has previously been proven to both be equivalent to Q-table initialisation and guarantee policy invariance in single-agent reinforcement learning. The method has since been used in multi-agent reinforcement learning ...
Landmark Based Reward Shaping in Reinforcement Learning with Hidden States
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

While most of the work on reward shaping focuses on fully observable problems, there are very few studies that couple reward shaping with partial observability. Moreover, for problems with hidden states, where there is no prior information about the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

May 2017

1914 pages

General Chairs:
Kate Larson
University of Waterloo, Canada
,
Michael Winikoff
University of Otago, New Zealand
,
Program Chairs:
Sanmay Das
Washington University in St. Louis, USA
,
Edmund Durfee
University of Michigan, USA

Sponsors

IFAAMAS

In-Cooperation

ACM: Association for Computing Machinery

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 08 May 2017

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

AAMAS '17 Paper Acceptance Rate 127 of 457 submissions, 28%;

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
503
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Serrano-Cuevas JMorales EHernández-Leal P(2020)Safe reinforcement learning using risk mapping by similarityAdaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems10.1177/105971231985965028:4(213-224)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.1177/1059712319859650
Du YHan LFang MDai TLiu JTao DWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)LIIRProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454683(4403-4414)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454683
Demir AÇilden EPolat FElkind EVeloso MAgmon NTaylor M(2019)Landmark Based Reward Shaping in Reinforcement Learning with Hidden StatesProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3331964(1922-1924)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.5555/3306127.3331964
Balakrishnan ADeshmukh JOzay NPrabhakar P(2019)Structured reward functions using STLProceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control10.1145/3302504.3313355(270-271)Online publication date: 16-Apr-2019
https://dl.acm.org/doi/10.1145/3302504.3313355
Gimelfarb MSanner SLee C(2018)Reinforcement learning with multiple expertsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327623(9549-9559)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327546.3327623

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents