In artificial intelligence, apprenticeship learning (or learning from demonstration or imitation learning) is the process of learning by observing an expert. [1] [2] It can be viewed as a form of supervised learning, where the training dataset consists of task executions by a demonstration teacher. [2]
Mapping methods try to mimic the expert by forming a direct mapping either from states to actions, [2] or from states to reward values. [1] For example, in 2002 researchers used such an approach to teach an AIBO robot basic soccer skills. [2]
Inverse reinforcement learning (IRL) is the process of deriving a reward function from observed behavior. While ordinary "reinforcement learning" involves using rewards and punishments to learn behavior, in IRL the direction is reversed, and a robot observes a person's behavior to figure out what goal that behavior seems to be trying to achieve. [3] The IRL problem can be defined as: [4]
Given 1) measurements of an agent's behaviour over time, in a variety of circumstances; 2) measurements of the sensory inputs to that agent; 3) a model of the physical environment (including the agent's body): Determine the reward function that the agent is optimizing.
IRL researcher Stuart J. Russell proposes that IRL might be used to observe humans and attempt to codify their complex "ethical values", in an effort to create "ethical robots" that might someday know "not to cook your cat" without needing to be explicitly told. [5] The scenario can be modeled as a "cooperative inverse reinforcement learning game", where a "person" player and a "robot" player cooperate to secure the person's implicit goals, despite these goals not being explicitly known by either the person nor the robot. [6] [7]
In 2017, OpenAI and DeepMind applied deep learning to the cooperative inverse reinforcement learning in simple domains such as Atari games and straightforward robot tasks such as backflips. The human role was limited to answering queries from the robot as to which of two different actions were preferred. The researchers found evidence that the techniques may be economically scalable to modern systems. [8] [9]
Apprenticeship via inverse reinforcement learning (AIRP) was developed by in 2004 Pieter Abbeel, Professor in Berkeley's EE CS department, and Andrew Ng, Associate Professor in Stanford University's Computer Science Department. AIRP deals with "Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform". [1] AIRP has been used to model reward functions of highly dynamic scenarios where there is no obvious reward function intuitively. Take the task of driving for example, there are many different objectives working simultaneously - such as maintaining safe following distance, a good speed, not changing lanes too often, etc. This task, may seem easy at first glance, but a trivial reward function may not converge to the policy wanted.
One domain where AIRP has been used extensively is helicopter control. While simple trajectories can be intuitively derived, complicated tasks like aerobatics for shows has been successful. These include aerobatic maneuvers like - in-place flips, in-place rolls, loops, hurricanes and even auto-rotation landings. This work was developed by Pieter Abbeel, Adam Coates, and Andrew Ng - "Autonomous Helicopter Aerobatics through Apprenticeship Learning" [10]
System models try to mimic the expert by modeling world dynamics. [2]
The system learns rules to associate preconditions and postconditions with each action. In one 1994 demonstration, a humanoid learns a generalized plan from only two demonstrations of a repetitive ball collection task. [2]
Learning from demonstration is often explained from a perspective that the working Robot-control-system is available and the human-demonstrator is using it. And indeed, if the software works, the Human operator takes the robot-arm, makes a move with it, and the robot will reproduce the action later. For example, he teaches the robot-arm how to put a cup under a coffeemaker and press the start-button. In the replay phase, the robot is imitating this behavior 1:1. But that is not how the system works internally; it is only what the audience can observe. In reality, Learning from demonstration is much more complex. One of the first works on learning by robot apprentices (anthropomorphic robots learning by imitation) was Adrian Stoica's PhD thesis in 1995. [11]
In 1997, robotics expert Stefan Schaal was working on the Sarcos robot-arm. The goal was simple: solve the pendulum swingup task. The robot itself can execute a movement, and as a result, the pendulum is moving. The problem is, that it is unclear what actions will result into which movement. It is an Optimal control-problem which can be described with mathematical formulas but is hard to solve. The idea from Schaal was, not to use a Brute-force solver but record the movements of a human-demonstration. The angle of the pendulum is logged over three seconds at the y-axis. This results into a diagram which produces a pattern. [12]
time (seconds) | angle (radians) |
---|---|
0 | -3.0 |
0.5 | -2.8 |
1.0 | -4.5 |
1.5 | -1.0 |
In computer animation, the principle is called spline animation. [13] That means, on the x-axis the time is given, for example 0.5 seconds, 1.0 seconds, 1.5 seconds, while on the y-axis is the variable given. In most cases it's the position of an object. In the inverted pendulum it is the angle.
The overall task consists of two parts: recording the angle over time and reproducing the recorded motion. The reproducing step is surprisingly simple. As an input we know, in which time step which angle the pendulum must have. Bringing the system to a state is called “Tracking control” or PID control. That means, we have a trajectory over time, and must find control actions to map the system to this trajectory. Other authors call the principle “steering behavior”, [14] because the aim is to bring a robot to a given line.
Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Soar is a cognitive architecture, originally created by John Laird, Allen Newell, and Paul Rosenbloom at Carnegie Mellon University.
Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment, and it can handle problems with stochastic transitions and rewards without requiring adaptations.
Automated planning and scheduling, sometimes denoted as simply AI planning, is a branch of artificial intelligence that concerns the realization of strategies or action sequences, typically for execution by intelligent agents, autonomous robots and unmanned vehicles. Unlike classical control and classification problems, the solutions are complex and must be discovered and optimized in multidimensional space. Planning is also related to decision theory.
Perceptual control theory (PCT) is a model of behavior based on the properties of negative feedback control loops. A control loop maintains a sensed variable at or near a reference value by means of the effects of its outputs upon that variable, as mediated by physical properties of the environment. In engineering control theory, reference values are set by a user outside the system. An example is a thermostat. In a living organism, reference values for controlled perceptual variables are endogenously maintained. Biological homeostasis and reflexes are simple, low-level examples. The discovery of mathematical principles of control introduced a way to model a negative feedback loop closed through the environment, which spawned perceptual control theory. It differs fundamentally from some models in behavioral and cognitive psychology that model stimuli as causes of behavior. PCT research is published in experimental psychology, neuroscience, ethology, anthropology, linguistics, sociology, robotics, developmental psychology, organizational psychology and management, and a number of other fields. PCT has been applied to design and administration of educational systems, and has led to a psychotherapy called the method of levels.
In intelligence and artificial intelligence, an intelligent agent (IA) is an agent acting in an intelligent manner. It perceives its environment, takes actions autonomously in order to achieve goals, and may improve its performance with learning or acquiring knowledge. An intelligent agent may be simple or complex: A thermostat or other control system is considered an example of an intelligent agent, as is a human being, as is any system that meets the definition, such as a firm, a state, or a biome.
Robot learning is a research field at the intersection of machine learning and robotics. It studies techniques allowing a robot to acquire novel skills or adapt to its environment through learning algorithms. The embodiment of the robot, situated in a physical embedding, provides at the same time specific difficulties and opportunities for guiding the learning process.
In computer science, programming by demonstration (PbD) is an end-user development technique for teaching a computer or a robot new behaviors by demonstrating the task to transfer directly instead of programming it through machine commands.
Adaptable Robotics refers to a field of robotics with a focus on creating robotic systems capable of adjusting their hardware and software components to perform a wide range of tasks while adapting to varying environments. The 1960s introduced robotics into the industrial field. Since then, the need to make robots with new forms of actuation, adaptability, sensing and perception, and even the ability to learn stemmed the field of adaptable robotics. Significant developments such as the PUMA robot, manipulation research, soft robotics, swarm robotics, AI, cobots, bio-inspired approaches, and more ongoing research have advanced the adaptable robotics field tremendously. Adaptable robots are usually associated with their development kit, typically used to create autonomous mobile robots. In some cases, an adaptable kit will still be functional even when certain components break.
Imitative learning is a type of social learning whereby new behaviors are acquired via imitation. Imitation aids in communication, social interaction, and the ability to modulate one's emotions to account for the emotions of others, and is "essential for healthy sensorimotor development and social functioning". The ability to match one's actions to those observed in others occurs in humans and animals; imitative learning plays an important role in humans in cultural development. Imitative learning is different from observational learning in that it requires a duplication of the behaviour exhibited by the model, whereas observational learning can occur when the learner observes an unwanted behaviour and its subsequent consequences and as a result learns to avoid that behaviour.
In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances its intended objectives. A misaligned AI system may pursue some objectives, but not the intended ones.
Stefan Schaal is a German-American computer scientist specializing in robotics, machine learning, autonomous systems, and computational neuroscience.
Pieter Abbeel is a professor of electrical engineering and computer sciences, Director of the Berkeley Robot Learning Lab, and co-director of the Berkeley AI Research (BAIR) Lab at the University of California, Berkeley. He is also the co-founder of covariant.ai, a venture-funded start-up that aims to teach robots new, complex skills, and co-founder of Gradescope, an online grading system that has been implemented in over 500 universities nationwide. He is best known for his cutting-edge research in robotics and machine learning, particularly in deep reinforcement learning. In 2021, he joined AIX Ventures as an Investment Partner. AIX Ventures is a venture capital fund that invests in artificial intelligence startups.
Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.
Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex group dynamics.
The Center for Human-Compatible Artificial Intelligence (CHAI) is a research center at the University of California, Berkeley focusing on advanced artificial intelligence (AI) safety methods. The center was founded in 2016 by a group of academics led by Berkeley computer science professor and AI expert Stuart J. Russell. Russell is known for co-authoring the widely used AI textbook Artificial Intelligence: A Modern Approach.
Intrinsic motivation in the study of artificial intelligence and robotics is a mechanism for enabling artificial agents to exhibit inherently rewarding behaviours such as exploration and curiosity, grouped under the same term in the study of psychology. Psychologists consider intrinsic motivation in humans to be the drive to perform an activity for inherent satisfaction – just for the fun or challenge of it.
Proximal policy optimization (PPO) is an algorithm in the field of reinforcement learning that trains a computer agent's decision function to accomplish difficult tasks. PPO was developed by John Schulman in 2017, and had become the default reinforcement learning algorithm at American artificial intelligence company OpenAI. In 2018 PPO had received a wide variety of successes, such as controlling a robotic arm, beating professional players at Dota 2, and excelling in Atari games. Many experts called PPO the state of the art because it seems to strike a balance between performance and comprehension. Compared with other algorithms, the three main advantages of PPO are simplicity, stability, and sample efficiency.
In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent to human preferences. It involves training a reward model to represent human preferences, which can then be used to train other models through reinforcement learning.
Specification gaming or reward hacking occurs when an AI optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a loophole in the task specification."
{{cite book}}
: CS1 maint: multiple names: authors list (link)