3.1. Preliminary
In reinforcement learning, we consider that an agent interacts with an asymmetric environment over a number of discrete time steps, which in general follows the Markov Decision Process (MDP). The MDP is a tuple, where S is a set of states, A is a set of actions, R is a set of immediate rewards, is the transition probability (it represents the probability of transition from current state s to next state by taking an action a), and is the robot’s initial state distribution. A policy maps the state to the designated action, where denotes the probability of choosing action over state. The localization of the robot is taken from the simulator by default, and it is calculated by the track deduction.
During the robot’s learning process, we aim to choose a policy that maximizes the performance measure, , which represents a finite horizon discounted total return, where is the discount factor, is a roll-out of pairs by following a policy (note the distribution of may change during the roll-out). The distribution of depends on policy , where ,.
A value function denotes the accumulated discounted return of the roll-out by selecting actions according to . The advantage function is .
The classical policy gradient method estimates the policy gradient
by sampling the environment using
in an on-policy manner and it is estimated using the stochastic gradient ascent algorithm. The loss function is
. However, in general, the on-policy method learns much slower than the off-policy since it has to learn episode by episode due to the fact that it must sample and learn over the same
. Therefore, it is appealing to use the off-policy method, and PPO is currently one of the candidates:
where
is the conservative policy iteration. It may cause a large update step without a constraint. Therefore,
is used to penalize a large gradient step when the ratio
is too large. The PPO method takes advantage of the policy gradient method in terms of its gradient estimator and constrains the ratio of the learned policy
and
in the interval
.
3.2. Problem Setting
The aim of this paper is to provide a local planner that predicts the velocity of the robot by given the observations and the local target
:
where
is the observation from the laser scanner,
is the local target calculated by the global planner, and
is the velocity command from the last time step. The functionality of elements of methodology is shown in
Figure 1. Here,
is used to derive the relative local pose and the angle to path, which are concatenated in
. The final state representation is
.
The objective function is formulated as
where the objective function
J defines the terminal reward
and immediate reward
and
. Since the robot interacts with the environment using its own kinematic model, therefore, the kinematic constrains are embedded in the sample roll-outs. The learning process is model free.
3.3. Algorithms
The method in this paper adapts the PPO method and modifies it into an asynchronous manner to solve the navigation task. The robot learning approach is shown in
Figure 2. The approach is to have the robot learn the strategy in an asynchronous manner. The robot interacts with scenes in an asymmetric environment. Each scene that the robot learns is divided into 4 parts, each representing a specific start and goal. A dedicated worker is sent to collect a representation of the state in each puzzle. Once the robot learns the generalized strategy element
in one scenario (i.e., scenario 1 in the upper left corner), it copies a set of initialization parameters to another scenario (i.e., scenario 2 in the upper right) and explores the effect of the parameters learned in scenario 1 on the success rate of the robot in scenario 2. A global PPO is created to interact with the scenes in the asymmetric environment. Each scene is divided into sub-scenes to provide a training environment for each worker. The worker is initialized with one patch of the scene, and it collects roll-outs from the environment. The environment, on the other hand, runs as a server to generate
packets that wait for a worker to collect them. Unlike the normal environment, which purely reports the locomotion of the robot and controls its movements, the environment used here is essentially a navigation framework. It also provides a valid path from a global planner. The global planner used in this paper adapts the A* algorithm and runs a trajectory smoother afterward. It serves as a plug-in in the complete environment. By taking advantage of the global planner, the policy learned in this paper is a local planner in the navigation task. Once learned, the proposed method can be easily modified and adapted in the real environment and the navigation framework.
As mentioned in
Section 3, the state observations
have three manifolds. The state
that related to the laser scans was originally 360-ways, and it was reduced to 25 by every
in a range
. We did not use the full set of laser scans for 2 reasons: (i) the robot was not allowed to move backward; (ii) we did not want to deliberately increase the complexity of the learning task. The local goal
was computed by tracing from the last node of the global path toward the start node and stopped at the edge of the a pre-defined window size. The concept is illustrated in the top left corner of
Figure 2 (a zoom-in view of one particular task). Once
was determined, it was used as a local target to attract the agent to consider approaching it. The first two dimensions of
were calculated by transforming the target in the robot coordinate and expressed in polar space (distance and angle). The third dimension was the angle to this path.
was the linear and angular velocity from the last time step.
The design of reward function is critical in the reinforcement learning method. Although a sparse rewarding scheme may also work, it is desirable to give the agent a prompt reward to guide the learning process. The reward functions are defined as follows:
If the robot has arrived at the target within a distance threshold check , a positive reward is given, but if the robot has collided with the the obstacle through a minimum range check , a negative reward is given. These conditions stop the episode and restart a new one. Otherwise, the rewards are the distance difference between the last target in the robot frame and the current target in the robot frame and the angle difference between the last angle to path and the current angle to path . These encourage the robot to follow the global plan but in a local manner.
The detailed implementation of the methodology is summarized in Algorithm 1. In Algorithm 1, we introduce the main structure of our asynchronous PPO.
Algorithm 1. Asynchronous PPO Pseudo-Code |
Set number of workers N |
Set minimum batch size |
Initialize global buffer |
Forn in N then |
Initialize each worker using Worker () in Algorithm 2 |
end for |
Train network using Train method in Algorithm 3 when notified |
In Algorithm 2, we initialized the data for each worker to update the robot data for velocity, reward, act, and state according to the actual situation. In Algorithm 3, we combined the robot with velocity, act, and state, and train critic network weights and actor network weights. The main thread created
N workers to collect roll-outs from different scenes. Each worker was initialized by using the method in Algorithm 2. Then, the training method was created by using Algorithm 3.
Algorithm 2. Asynchronous PPO Pseudo-Code for each Worker Thread |
Worker |
Set global step counter |
Set global PPO network |
Set global buffer queue QUEUE |
Initialize training environment ENV |
Initialize local buffer |
Restart a new game and get Initial state |
repeat: |
Take action by sampling the target distribution |
Receive from ENV and append them to buffer |
If done or then: |
|
Forr in then: |
Compute critic |
After each batch during training, then |
End For |
Reverse |
Push into |
Reset |
end if |
Notify main thread for update network |
Algorithm 3. Pseudo-Code for Train Method in . |
Initialize critic network weights |
Initialize actor network weights |
Initialize target critic network weights |
Initialize target actor network weights |
Set learning rate for actor and critic and |
Set decay rate for critic network |
Train method (): |
Critic update: |
|
where is advantage |
|
Actor update: |
|
|
|
|
In Algorithm 2, the worker was initialized with (i) a
, which was used to predict the value
(we use the target critic network to calculate
to enhance the training stability); (ii) an environment that was carefully designed by the user; (iii) a queue that was used to collect the sample roll-outs from each worker. In Algorithm 3, the actor and critic networks were firstly initialized with a uniform distribution. The detailed structure of the network is shown in
Figure 3.
According to Algorithm 3, the critic network will firstly iterate and then iterate the actor network until the convergence of the algorithm is completed. The critic network implemented 3 dense layers with 128 hidden units in each layer. The actor network used a similar structure but gave 4 dimensions of outputs . Then, these outputs were plugged into a joint normal distribution to enable the robot to make decisions in a stochastic manner.