skip to main content
research-article

Part 4: Reinforcement learning: Machine learning and natural learning

Published: 01 September 2006 Publication History

Abstract

The theory of reinforcement learning (RL) was originally motivated by animal learning of sequential behavior, but has been developed and extended in the field of machine learning as an approach to Markov decision processes. Recently, a number of neuroscience studies have suggested a relationship between reward-related activities in the brain and functions necessary for RL. Regarding the history of RL, we introduce in this article the theory of RL and present two engineering applications. Then we discuss possible implementations in the brain.

References

[1]
Alexander, G.E., Crutcher, M.D. and DeLong, M.R., “Basal Gangliathalamocortical Circuits: Parallel Substrates for Motor, Oculomotor, “Prefrontal” and “Limbic” Functions,”Progress in Brain Research, 85, pp. 119–146, 1990.
[2]
Amari, S., “Natural Gradient Works Efficiently in Learning,”neural Computation, 10, 2, pp. 251–276, 1998.
[3]
Barraclough, D.J., Conroy, M.L. and Lee, D., “Prefrontal Cortex and Decision Making in a Mixed-strategy Game,”Nature Neuroscience, 7, pp. 404–410, 2004.
[4]
Barto, A.G., Sutton, R.S. and Anderson, C.W., “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems,”IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, pp. 834–846, 1983.
[5]
Barto, A.G., “Adaptive Critics And the Basal Ganglia,” inModels of Information Processing in the Basal Ganglia, pp. 215–232, MIT Press, Cambridge, MA, 1994.
[6]
Bellman, R.E.,Dynamic Programming, Princeton University Press, Princeton, 1957.
[7]
Braver, T.S. and Barch, D.M., “A Theory of Cognitive Control, Aging Cognition, and Neuromodulation,”Neuroscience and Biobehavioral Reviews, 26, 7, pp. 809–817, 2002.
[8]
Brafman, R.I., “A Heuristic Variable Grid Solution for POMDPs,” inFourteenth National Conference on Artificial Intelligence, AAAI-9, pp. 33–42, 1997.
[9]
Cassandra, A.R., Kaelbling, L.P. and Littman, M.L., “Acting Optimally in Partially Observable Stochastic Domains,” inTwelfth National Conference on Artificial Intelligence, AAAI-94, pp. 1023–1028, 1994.
[10]
Cohen, J.D., Perlstein, W.M., Braver, T.S., Nystrom, L.E., Noll, D.C., Jonides, J. and Smith, E.E., “Temporal Dynamics of Brain Activation During a Working Memory Task,”Nature, 386, pp. 604–608, 1997.
[11]
Cohen, J.D., Braver, T.S. and Brown, J.W., “Computational Perspectives on Dopamine Function in Prefrontal Cortex,”Current Opinion in Neurobiology, 12, 2, pp. 223–229, 2002.
[12]
Doya, K., “Complementary Roles of Basal Ganglia and Cerebellum in Learning and Motor Control,”Current Opinion in Neurobiology, 10, 6, pp. 732–739, 2000.
[13]
Doya, K., “Computational Model of Neuromodulation,”Neural Networks, 15, 4–6, pp. 475–477, 2002.
[14]
Daw, N.D., Niv, Y. and Dayan, P., “Uncertainty-based Competition between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control,”Nature Neuroscience, 8, pp. 1704–1711, 2005.
[15]
Fiorillo, C.D., Tobler, P.N. and Schultz, W., “Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons,”Science, 299, pp. 1898–1902, 2003.
[16]
Garfen, C.R., Herkenham, M. and Thibault, J., “The Neostriatal Mosaic. II. Patch and Matrix Directed Ddesostriatal Dopaminergic and Nondopaminergic Systems,”The Journal of Neuroscience, 7, pp. 3915–3934, 1987.
[17]
Graybiel, A.M., “Neurotransmitters and Neuromodulators in the Basal Ganglia,”Trends in Neurosciences, 13, pp. 244–254, 1990.
[18]
Grillner, S., Wallen, P., Brodin, L. and Lansner, A., “Neural Network Generating Locomotor Behavior in Lamprey,”Annual Review of Neuroscience, 14, pp. 169–199, 1991.
[19]
Hoshi, E., Shima, K. and Tanji, J., “Neuronal Activity in the Primate Prefrontal Cortex in the Process of Motor Selection Based on Two Behavioral Rules,”Journal of Neurophysiology, 83, pp. 2355–2373, 2000.
[20]
Howard, R.A.,Dynamic Programming and Markov Processes, MIT Press, Cambridge, MA, 1960.
[21]
Ishii, S., Yoshida, W. and Yoshimoto, J., “Control of Exploitation-exploration Meta-parameter in Reinforcement Learning,”Neural Networks, 15, pp. 665–687, 2002.
[22]
Ishii, S., Fujita, H., Mitsutake, M., Yamazaki, T., Matsuda, J. and Matsuno, Y., “A Reinforcement Learning Scheme for a Partially-observable Multi-agent Game,”Machine Learning, 59, pp. 31–54, 2005.
[23]
Kaelbling, L.P., Littman, M. and Cassandra, A., “Planning and Acting in Partially Observable Stochastic Domains,”Artificial Intelligence, 101, pp. 99–134, 1998.
[24]
Kakade, S., “A Natural Policy Gradient,” inAdvances in Neural Information Processing Systems 14, pp. 1531–1538, 2001.
[25]
Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Expectation of Reward Modulates Cognitive Signals in the Basal Ganglia,”Nature Neuroscience, 1, 5, pp. 411–416, 1998.
[26]
Konda, V.R. and Tsitsiklis, J.N., “Actor-critic Algorithms,”SIAM Journal on Control and Optimization, 42, pp. 1143–1146, 2003.
[27]
Leon, M.I. and Shadlen, M.N., “Effect of Expected Reward Magnitude on the Response of Neurons in the Dorsolateral Prefrontal Cortex of the Macaque,”Neuron, 24, pp. 415–425, 1999.
[28]
Mogenson, G.J., Takigawa, M., Robertson, A. and Wu, M., “Self-stimulation of the Nucleus Accumbens and Ventral Tegmental Area of Tsai Attenuated by Microinjections of Spiroperidol into the Nucleus Accumbens,”Brain Research, 171, 2, pp. 247–259, 1979.
[29]
Montague, P.R., Dayan, P. and Sejnowski, T.J., “A Framework for Mesencephalic Dopamine Systems Based on Predictive Hebbian Learning,”The Journal of Neuroscience, 16, pp. 1936–1947, 1996.
[30]
Moore, A.W. and Atkeson, C.G., “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time,”Machine Learning, 13, pp. 103–130, 1993.
[31]
Mori, T., Nakamura, Y., Sato, M. and Ishii, S., “Reinforcement Learning for CPG-driven Biped Robot,” inThe Nineteenth National Conference on Artificial Intelligence, AAAI-04, pp. 623–630, 2004.
[32]
Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Dopamine Neurons Can Represent Context-dependent Prediction Error,”Neuron, 41, pp. 269–280, 2004.
[33]
Nakamura, Y., Mori, T., Tokita, Y., Shibata, T. and Ishii, S., “Off-policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller,”Journal of Robotics and Mechatronics, 17, 6, pp. 636–644, 2005.
[34]
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K. and Dolan, R.J., “Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning,”Science, 304, pp. 452–454, 2004.
[35]
Olds, J. and Milner, P., “Positive Reinforcement Produced by Electrical Stimulation of Septal Area and Other Regions of Rat Brain,”Journal of Computational Physiological Psychology, 47, pp. 19–27, 1954.
[36]
Parr, R. and Russell, S., “Approximating Optimal Policies for Partially Observable Stochastic Domains,” inProceedings of International Joint Conference on Artificial Intelligence, IJCAI-95, pp. 1088–1094, 1995.
[37]
Peters, J., Vijayakumar, S. and Schaal, S., “Reinforcement learning for humanoid robotics,” inThird IEEE International Conference on Humanoid Robotics, 2003.
[38]
Precup, D., Sutton, R.S. and Dasgupra, S., “Off-policy Temporal-difference Learning with Function Approximation,” inProceedings of the 18th International Conference on Machine Learning, ICML, pp. 417–424, 2001.
[39]
Pochon, J.B., Levy, R., Poline, J.B., Crozier, S., Lehericy, S., Pillon, B., Deweer, B., Le Bihan, D. and Dubois, B., “The Role of Dorsolateral Prefrontal Cortex in the Preparation of Forthcoming Actions: An fMRI Study,”Cerebral Cortex, 11, pp. 260–266, 2001.
[40]
Poupart, P. and Boutilier, C., “Value-directed Compression of POMDPs,” inAdvances in Neural Information Processing Systems 15, pp. 1579–1586, 2003.
[41]
Rescorla, R.A. and Wagner, A.R., “A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement” inClassical Conditioning II: Current Research and Theory, pp. 64–99, New York, NY: Appleton, 1972.
[42]
Reynolds, J.N., Hyland, B.I. and Wickens, J.R., “A Cellular Mechanism of Reward-related Learning,”Nature, 413, pp. 67–70, 2001.
[43]
Robbins, T.W. and Everitt, B.J., “Neurobehavioural Mechanisms of Reward and Motivation,”Current Opinion in Neurobiology, 6, 2, pp. 228–236, 1996.
[44]
Rodoriguez, A., Parr, R. and Koller, D., “Reinforcement Learning Using Approximate Belief State,” inAdvances in Neural Information Processing Systems 12, pp. 1036–1042, 2002.
[45]
Samejima, K., Ueda, Y., Doya, K. and Kimura, M., “Representation of Actionspecific Reward Values in The Striatum,”Science, 310, pp. 1337–1340, 2005.
[46]
Sato, M., Nakamura, Y. and Ishii, S., “Reinforcement Learning for Biped Locomotion, inNeural Networks-ICANN 2002, LNCS2415, pp. 777–782, Springer-Verlag, Berlin, 2002.
[47]
Schultz, W., Dayan, P. and Montague, R.P., “A Neural Substrate of Prediction and Reward,”Science, 275, pp. 1593–1599, 1997.
[48]
Seymour, B., O’Doherty, J.P., Dayan, P., Koltzenburg, M., Jones, A.K., Dolan, R.J., Friston, K.J. and Frackowiak, R.S., “Temporal Difference Models Describe Higher-order Learning in Humans,”Nature, 429, pp. 664–667, 2004.
[49]
Shelton, C.R., “Policy Improvement for POMDPs Using Normalized Importance Sampling,” inProceedings of the Seventeenth International Conference on Uncertainty in Artificial Intelligence (UAI), pp. 496–503, 2001.
[50]
Shidara, M., Aigner, T.G. and Richmond, B.J., “Neuronal Signals in the Monkey Ventral Striatum Related to Progress Through a Predictable Series of Trials,”The Journal of Neuroscience, 18, 7, pp. 2613–2625, 1998.
[51]
Smallwood, R.D. and Sondik, E.J., “The Optimal Control of Partially Observable Markov Decision Processes Over a Finite Horizon,”Operations Research, 21, 1071–1088, 1973.
[52]
Stolerman, I., “Drugs of Abuse: Behavioural Principles, Methods and Terms,”Trends in Pharmacological Sciences, 13, 5, pp. 170–176, 1992.
[53]
Sutton, R.S. and Barto, B.G., “Towards a Modern Theory of Adaptive Networks: Expectation and Prediction,”Psychological Review, 88, pp. 135–170, 1981.
[54]
Sutton, R.S., “Learning to Predict by the Method of Temporal Differences,”Machine Learning, 3, pp. 9–44, 1988.
[55]
Sutton, R.S. and Barto, A.G.,Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1988.
[56]
Sutton, R.S., McAllester, D., Singh, S. and Manour, Y., “Policy Gradient Method for Reinforcement Learning with Function Approximation,” inAdvances in Neural Information Processing Systems 12, pp. 1057–1063, 2000.
[57]
Taga, G., Yamaguchi, Y. and Shimizu, H., “Self-organized Control in Bipedal Locomotion by Neural Oscillators in Unpredictable Environment,”Biological Cybernetics, 65, pp. 147–159, 1991.
[58]
Tanji, J. and Hoshi, E., “Behavioral Planning in the Prefrontal Cortex,”Current Opinion in Neurobiology, 11, pp. 164–170, 2001.
[59]
Watanabe, M., “Reward expectancy in primate prefrontal neurons,”Nature, 382, pp. 629–632, 1996.
[60]
Watkins, C.J.C.H. and Dayan, P., “Q-learning,”Machine Learning, 8(3/4), pp. 279–292, 1992.
[61]
Williams, R., “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,”Machine Learning, 8, pp. 229–256, 1992.
[62]
Yoshida, W. and Ishii, S., “Model-based Reinforcement Learning: A Computational Model and an fMRI Study,”Neurocomputing, 3C, pp. 253–269, 2005.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image New Generation Computing
New Generation Computing  Volume 24, Issue 3
Sep 2006
146 pages

Publisher

Ohmsha

Japan

Publication History

Published: 01 September 2006

Author Tags

  1. Reinforcement Learning
  2. Temporal Difference
  3. Actor-critic
  4. Reward System
  5. Dopamine

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media