Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System

Yang, Kejie; Zhu, Ming; Guo, Xiao; Zhang, Yifei; Zhou, Yuting

doi:10.3390/drones9020103

Open AccessArticle

Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System

by

Kejie Yang

¹,

Ming Zhu

²

,

Xiao Guo

²,

Yifei Zhang

^2,* and

Yuting Zhou

³

¹

School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China

²

Institute of Unmanned System, Beihang University, Beijing 100191, China

³

Beijing Aerospace Propulsion Institute, Beijing 100076, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(2), 103; https://doi.org/10.3390/drones9020103

Submission received: 3 January 2025 / Revised: 26 January 2025 / Accepted: 28 January 2025 / Published: 30 January 2025

(This article belongs to the Special Issue AI-Assisted Control Strategies and Their Applications to the Stabilization, Guidance and Navigation of Drones)

Download

Browse Figures

Versions Notes

Abstract

:

Utilizing fully distributed intelligent control algorithms has enabled the gradual adoption of the multiple unmanned aerial vehicles system for executing Target Tracking and Encirclement missions in industrial and civil applications. Restricted by the evasion behavior of the target, current studies focus on constructing zero-sum game settings, and existing strategy solvers that accommodate continuous state-action spaces have exhibited only modest performance. To tackle the challenges mentioned above, we devise a Stochastic Potential Game framework to model the mission scenario while considering the environment’s limited observability. Furthermore, a multi-agent reinforcement learning method is proposed to estimate the near Nash Equilibrium strategy in the above game scenario, which utilizes time-serial relative kinematic information and obstacle observation. In addition, considering collision avoidance and cooperative tracking, several techniques, such as novel reward functions and recurrent network structures, are presented to optimize the training process. The results of numerical simulations demonstrate that the proposed method exhibits superior search capability for Nash strategies. Moreover, through dynamic virtual experiments conducted with speed and attitude controllers, it has been shown that well-trained actors can effectively act as practical navigators for the real-time swarm control.

Keywords:

stochastic potential game; multiple unmanned aerial vehicles system; multi-agent reinforcement learning; target tracking and encirclement; pursue–evade game

1. Introduction

The Multiple Unmanned Aerial Vehicles System (MUAVS) plays an important role in both industrial application and civil field with the rapid evolving speed of intelligent distributed control algorithms. Deploying MUAVS in disaster relief efforts [1] or infrastructure checking [2] for remote regions is beneficial not only for saving the expenses of manpower and material resources but also for improving the efficiency of mission completion. In these real-world applications, system must possess both obstacle avoidance and Target Tracking and Encirclement (TTE) capabilities [3] for non-cooperative objects.

TTE problem is a well-established topic in the field of aerial vehicle guidance and control, primarily focused on the estimation of target state and data association [4]. To achieve the goal, classic measures such as the Kalman Filter [5,6,7,8] and Symmetric Measurement Equations [9,10,11] are widely deployed with the automatic vehicle controllers that have practical value. In the controller part, some studies [12,13,14,15] pay attention to Model Predictive Control for achieving tracking. As the multi-agent system has gradually developed in the past decades, approaches for navigating collaborative UAV swarms have been studied. Several methods developed from graph theory are mentioned in [16,17], revealing the ideal results in kinematic simulation. In [18,19,20], the Actor–Critic (AC)-based MARL frameworks with different constraints are proposed to solve single- or multiple-target tracking, trajectory optimization, and collision avoidance problems, whose main innovations focus on unique reward functions. When the number of tracking members is insufficient, a novel Voronoi diagram with the Nash Q-learning method is introduced in [21] to produce UAV control input. Many of these solutions provide optimal results in simulation but are not suitable for taking account of the evasion tendencies of the target in real-world scenarios.

The field of game theory is experiencing rapid growth due to the rising interest in model-free multi-agent reinforcement learning (MARL) techniques [22,23,24,25]. Recent MARL studies focus on improving how agents effectively utilize observation information under partially observable conditions to enhance the efficiency of collective reward acquisition. In [26], the graph-based MARL algorithm is proposed by utilizing directed graph between agents and multi-heads attention mechanism to guide Q network iteration. After that, the branch of research direction based on directed graph gradually shows the top performance of multi-agents application. In [27], a novel information exchange mechanism between different agents is employed to update the critic based on the AC framework. This paper considers the interdependencies among agents and achieves better performance in standard test scenarios. In order to personalize peer-to-peer communication topology during task execution, a self-adapting graph formation is presented in [28]. This method does not restrict the communication topology between agents and can be dynamically adjusted according to the task execution conditions.

On the other hand, the advancement of modern computing technologies has led to significant progress in the numerical solution of general-sum games. A general-sum game is a game where the sum of all players’ payoffs is not necessarily zero, meaning players’ interests can be both competitive and cooperative. After the Nash Q-learning [29,30] created a precedent that RL methods could deal with game problems effectively, many model-free MARL methods as well as NE strategy solvers have been proposed in recent years. In [23], a general MARL algorithm called Policy-Space Response Oracles is introduced to obtain effective meta-strategies in partial observable environments, which attains the highest score in the Poker Game benchmark. The Nash Equilibrium (NE) [31] convergence of AC approaches is proven in [32] by Two-Timescales stochastic approximation, that is, breaking the hamper existing between continuous general-sum game and popular RL algorithms. Meanwhile, there have been many efforts in previous studies to apply game theories to the TTE problem. The Hamilton–Jacobi Equation for the Pursuer–Evader differential game based on the zero-sum formulation is discussed in [33], which provides a new perspective for solving general class target-hitting games. Furthermore, the bounded rational level-k Hamilton–Jacobi–Issca Equation in [34] is used to express the Pursuer–Evader game scenario for different levels of intelligence. While zero-sum games provide a straightforward description of the interaction between the trackers and target, other crucial factors such as collision avoidance and controllability are not accounted for.

Among many environment modeling methods is a game formulation known as the Potential Game (PG) [35]. Agents who have the unified potential function in PG are not acting like a team or opponents strictly but operating for multiple goals simultaneously and distributedly. Such characteristics are suitable for describing the aforementioned TTE scenario of MUAVS that needs to survive in the complex environment. In this condition, system members play as self-interesting agents trying to maximize their rewards affected by relative kinematic state and unknown environment in continuous action and state space. As a traditional problem of PG field, searching the NE solution is a common challenge for all related research [22,32,36]. The NE of a game represents a joint mixed strategy profile in which any agent cannot obtain higher payoff by modifying their own strategy profile. To solve this Polynomial Parity Arguments on Directed (PPAD) graphs difficulty issue [37], recent research has proposed the Stochastic Potential Game (SPG) concept, combining the Markov Decision Process (MDP) with PG to simplify the difficulty of solving [38,39], which adopts MARL approaches based on the Bellman operator as the solver. Although these methods could easily create state-of-the-art performance in discrete system with multiple agents [40], the effectiveness in continuous action-state space, such as MUAVS deployed in the real world, remains to be improved. Meanwhile, most research typically focuses on kinematic simulation without considering the dynamic properties of multi-agent systems; achieving intelligent auto-control in such systems remains an ongoing challenge.

Regarding the TTE problem, when targets possess intelligence comparable to that of the trackers, conventional target tracking algorithms face significant challenges in effectively coordinating UAV swarms for collaborative tracking missions. In real-world scenarios, it is challenging for UAVs to obtain the target’s path or predict its trajectory in advance, and the target may employ evasive maneuvers or countermeasures. Under such circumstances, it becomes essential for the UAV trackers to develop capabilities to address these complex situations effectively. Therefore, in this paper, we address the above-mentioned challenge for multi-agent systems operating in an unknown environment based on the SPG theory. While selecting quadrotors as system agents, a MARL approach is proposed for approximating the NE of the continuous SPG, which considers kinematic and dynamic constraints of both the trackers and target. The primary contribution of this paper is outlined below:

The SPG-based TTE scenario is constructed within a finite continuous action-state domain that involves adjusting the composite tasks of an MUAVS in various environments to achieve the TTE objective. All trackers are dealing with the uncertainty of the target’s motion pattern, as the target possesses equivalent intelligent decision-making capabilities within the scenario.
To search for local NE strategies, a Time-Series Multi-Agent Soft Actor–Critic (TMSAC) approach is proposed. It leverages sequential observations from agents and is particularly effective in determining the optimal strategy for MUAVS, especially when involving multiple agents. Furthermore, novel reward functions that account for the SPG condition are designed. These functions are integrated with the TMSAC framework to enhance performance for both the tracker and the target. The convergence of the algorithm is also discussed to justify its search capability for the local NE.
Considering the dynamical characteristics of agents, a guidance loop based on the actor training from TMSAC is combined with velocity and attitude controllers, deployed in the visualized physical simulation environment to show the effectiveness and the success rate of the proposed method.

The remainder of this article is organized as follows. Before analyzing environment properties, Section 2 presents the dynamic and relative kinematic models used in further research. After that, the construction of SPG and MARL method details are introduced in Section 3 and Section 4. Finally, multiple experiments and visualized simulation for performance tests are discussed in Section 5.

2. System Modeling

In this section, we analyze the relevant quadrotor dynamics and multi-agent kinematics before introducing the game environment and tracking formulation, performing as the fundamental background for further research.

2.1. Quadrotor Dynamics

We select a small-size quadrotor to play as the trackers and target in our research. As a significant factor for the tracking navigator design, it is necessary to notice the following assumptions before constructing a dynamic model.

Assumption 1.

The UAV model can be regarded as a rigid body, and the Little Disturbance Approach is used to linearize the dynamic formulation on the equilibrium points.

Assumption 2.

The mass distribution of the quadrotor is symmetrically arranged along all axes of the Body Reference Frame.

As in Figure 1, we present a universal quadrotor model that employs two coordinate systems, the Body Reference Frame (BRF) and the Earth Reference Frame (ERF), which conform to the Right-hand Rule. The BRF origin point

O_{b}

is equal to the aircraft barycenter, while the

O_{b} Z_{b}

axis is perpendicular to the fuselage plane, with the positive direction orienting to ground. The

O_{b} X_{b}

and

O_{b} Y_{b}

are perpendicular to each other in the fuselage plane. The ERF origin point

O_{g}

is fixed on the ground plane, with the

O_{g} X_{g}

heading to geographic north and the

O_{g} Y_{g}

orienting to geographic east. The

O_{g} Z_{g}

orients to the geocenter.

The velocity

ν = {[u, v, w]}^{T}

and angular velocity

ω = {[p, q, r]}^{T}

are defined based on the BRF. Meanwhile, coordinates vector

P = {[x, y, z]}^{T}

and Euler angle

Θ = {[ϕ, θ, ψ]}^{T}

are used to describe the quadrotor position and posture in the ERF. Thus, the transformation between the BRF and the ERF can be described as Equation (1) [41]:

[\begin{matrix} \dot{P} \\ \dot{Θ} \end{matrix}] = [\begin{matrix} R_{b}^{e} & 0_{3 \times 3} \\ 0_{3 \times 3} & R_{E} \end{matrix}] [\begin{matrix} ν \\ ω \end{matrix}],

(1)

where

R_{b}^{e}

is the coordinate transfer matrix defined as Equation (2), and

0_{3 \times 3}

is a 2-D

3 \times 3

zero matrix. For concision,

s i n (\cdot)

and

c o s (\cdot)

are represented by s and c separately,

R_{b}^{e} = [\begin{matrix} c θ \cdot c ψ & s ϕ \cdot s θ \cdot c ψ - c ϕ \cdot s ψ & c ϕ \cdot s θ \cdot c ψ + s θ \cdot s ψ \\ c θ \cdot s ψ & s ϕ \cdot s θ \cdot s ψ + c ϕ \cdot c ψ & c ϕ \cdot s θ \cdot s ψ - s θ \cdot c ψ \\ - s θ & s ϕ \cdot c θ & c ϕ \cdot c θ \end{matrix}] .

(2)

The

R_{E}

is the Euler rotation matrix defined as

R_{E} = [\begin{matrix} 1 & s i n ϕ \cdot t a n θ & c o s ϕ \cdot t a n θ \\ 0 & c o s ϕ & - s i n ϕ \\ 0 & s i n ϕ \cdot s e c θ & c o s ϕ \cdot s e c θ \end{matrix}] .

(3)

The mass of quadrotor is presented as m. The principal axes of inertia are completely aligned with the axes of the BRF. According to Assumption 2, the product of inertia

[I_{x y}, I_{x z}, I_{y z}] = 0

, and moment of inertia are shown as

[I_{x}, I_{y}, I_{z}]

. Since the output force vector from propellers

F = F_{1} + F_{2} + F_{3} + F_{4}

and the moment

M

made by

F

are applied in the BRF, the dynamic model of quadrotor is given by Equations (4) and (5) [42]:

\dot{ν} = [\begin{matrix} r v - q w \\ p w - r u \\ q u - p v \end{matrix}] + \frac{1}{m} [\begin{matrix} F_{x} \\ F_{y} \\ m g - F_{z} \end{matrix}],

(4)

\dot{ω} = [\begin{matrix} [(I_{y} - I_{z}) q r + M_{x}] / I_{x} \\ [(I_{z} - I_{x}) p r + M_{y}] / I_{y} \\ [(I_{x} - I_{y}) p q + M_{z}] / I_{z} \end{matrix}] .

(5)

2.2. Multi-Agent Kinematics

Multi-agent kinematics represent the relative motion of the MUAVS members and the target, performing as the basis of SPG design. Taking the dynamic properties of quadrotors into account, it is essential to propose the following assumptions before defining the kinematics, which models the game environment reasonably.

Assumption 3.

In the multi-agent kinematics modeling, both MUAVS and the target are moving in the same two-dimensional plane, and no motion in the ERF Z-axis needs to be considered.

Assumption 4.

All quadrotors can obtain the relative position information of other agents through the airborne sensors.

Assumption 5.

Giving by Assumption 3, the ground velocity

v^{i}

synthesized by the

\dot{x}

and

\dot{y}

of the trackers are restricted in

v_{m a x} \geq v^{i}

. The yaw rate

\dot{ψ^{i}}

satisfies

{\dot{ψ}}_{m a x} \geq | \dot{ψ^{i}} |

.

For an MUAVS that has N units, each quadrotor

i

in the system satisfies the following formula to indicate the motion state:

\begin{matrix} {\dot{κ}}_{u}^{i} & = {[\begin{matrix} v^{i} sin ψ^{i} & v^{i} cos ψ^{i} & \dot{ψ^{i}} \end{matrix}]}^{T}, \end{matrix}

(6)

where

[x_{i}, y_{i}]

is the position of quadrotor i in the ERF. In the meantime, the target has the same form of the kinematic:

\begin{matrix} {\dot{κ}}_{t} & = {[\begin{matrix} v^{t} sin ψ^{t} & v^{t} cos ψ^{t} & \dot{ψ^{t}} \end{matrix}]}^{T} . \end{matrix}

(7)

Therefore, the relative position based on the Polar Coordinates for tracker–target and tracker–tracker can be given by

κ_{u - t}^{i} = {[\begin{matrix} d_{t}^{i} & φ_{u - t}^{i} \end{matrix}]}^{T},

(8)

\begin{matrix} κ_{u - u}^{i} = & [d_{u}^{i, 1} \dots d_{u}^{i, i - 1} & d_{u}^{i, i + 1} \dots d_{u}^{i, N}, \\ ϑ_{u}^{i, 1} \dots ϑ_{u}^{i, i - 1} & ϑ_{u}^{i, i + 1} \dots ϑ_{u}^{i, N} & ]^{T} \end{matrix},

(9)

where the

d_{t}^{i} = \sqrt{{(x_{u}^{i} - x_{t})}^{2} + {(y_{u}^{i} - y_{t})}^{2}}

represents the distance between tracker

i

and the target. In particular,

φ_{u - t}^{i}

is the relative yaw angle indicated as

φ_{u - t}^{i} = ψ_{i} - ξ_{t}^{i}

, where

ξ_{t}^{i} = arctan \frac{x_{t} - x^{i}}{y_{t} - y^{i}}

. The

d_{u}^{i, j}

is the distance between tracker i and j, while the

ϑ_{u}^{i, j} = | ξ_{t}^{i} - ξ_{t}^{j} |

is the relative encirclement angle. Based on the aforementioned definitions, the system’s relative kinematics can be expressed as follows:

{\dot{κ}}_{u - t}^{i} = [\begin{matrix} {\dot{d}}_{t}^{i} \\ {\dot{φ}}_{u - t}^{i} \end{matrix}] = [\begin{matrix} v_{u}^{i} cos φ_{u - t}^{i} - v_{t} cos (ψ_{t} - ξ_{t}^{i}) \\ \dot{ψ^{i}} - \dot{ξ_{t}^{i}} \end{matrix}],

(10)

{\dot{κ}}_{u - u}^{i} = [\begin{matrix} {\dot{d}}_{u}^{i, 1} \\ ⋮ \\ {\dot{d}}_{u}^{i, i - 1} \\ {\dot{d}}_{u}^{i, i + 1} \\ ⋮ \\ {\dot{d}}_{u}^{i, N} \\ {\dot{ϑ}}_{u}^{i, 1} \\ ⋮ \\ {\dot{ϑ}}_{u}^{i, i - 1} \\ {\dot{ϑ}}_{u}^{i, i + 1} \\ ⋮ \\ {\dot{ϑ}}_{u}^{i, N} \end{matrix}] = [\begin{matrix} v_{i} cos φ_{u - u}^{i, 1} - v_{1} cos (ψ^{1} - ξ_{u}^{i, 1}) \\ ⋮ \\ v_{i} cos φ_{u - u}^{i, i - 1} - v_{i - 1} cos (ψ^{i - 1} - ξ_{u}^{i, i - 1}) \\ v_{i} cos φ_{u - u}^{i, i + 1} - v_{i + 1} cos (ψ^{i + 1} - ξ_{u}^{i, i + 1}) \\ ⋮ \\ v_{i} cos φ_{u - u}^{i, N} - v_{N} cos (ψ^{N} - ξ_{u}^{i, N}) \\ \nabla arccos (({d_{t}^{1}}^{2} + {d_{t}^{i}}^{2} - {d_{u}^{1, i}}^{2}) / 2 d_{t}^{1} d_{t}^{i}) \\ ⋮ \\ \nabla arccos (({d_{t}^{i}}^{2} + {d_{t}^{i - 1}}^{2} - {d_{u}^{i, i - 1}}^{2}) / 2 d_{t}^{i} d_{t}^{i - 1}) \\ \nabla arccos (({d_{t}^{i}}^{2} + {d_{t}^{i + 1}}^{2} - {d_{u}^{i, i + 1}}^{2}) / 2 d_{t}^{i} d_{t}^{i + 1}) \\ ⋮ \\ \nabla arccos (({d_{t}^{i}}^{2} + {d_{t}^{N}}^{2} - {d_{u}^{i, N}}^{2}) / 2 d_{t}^{i} d_{t}^{N}) \end{matrix}] .

(11)

3. Stochastic Potential Game

In this section, the SPG formulation is used to establish the game relations between MUAVS and the target, simultaneously considering objectives such as tracking, encirclement, and obstacle avoidance. In the practical scenario where MUAVS executes the tracking and encirclement mission, the system performance is indicated by not only relative kinematics but also the survivability in an unknown environment. It means that the traditional zero-sum Pursuer–Evader game is not suitable for formulating the above situation. Motivated by such a reason, we adopt the SPG theory derived from the general-sum PG and MDP to describe the game status of the system.

The SPG environment is shown in Figure 2. We represent the continuous SPG as the tuple

G : = \{N, S, Π, R, P\}

, where

N : = {1, 2, \dots, N, t}

denotes the set of agents that has both the trackers and target. The remaining elements of the tuple

G

are described as follows:

$S : = {s^{1}, s^{2}, \dots, s^{N}, s^{t}} \forall i \in N$ indicates the set of agents state constructed by relative kinematic information and obstacle observation.
$Π : = \times_{i}^{N} π^{i}, \forall π^{i} \in A^{i}$ is the joint Markov strategy set where $π^{i}$ is the mix strategy gathered from continuous action space $A^{i}$ .
$P : = {P^{1}, P^{2}, \dots, P^{N}, P^{t}} \forall P^{i} : s^{i} \times π^{i} \times s^{i} \in [0, 1]$ is the state transfer probability function expressing the system response characteristic.
$R : = {R^{1}, R^{2}, \dots, R^{N}, R^{t}} \forall i \in N$ is the rewards function for each agent in $N$ .

In the traditional research field of stochastic game (SG), an augmented MDP involves agents from

N

who seek to maximize their rewards. The SPG concept is a subset of SG that combines PG properties, where a global potential function quantifies agents’ behaviors without relying on the characteristic of individual agents. In each step, every agent i takes an action

a^{i} \in π^{i}

, leading to a change in the potential function, while state

S

transfers to a new point based on

P

. The essential task in constructing the SPG is to demonstrate the existence of a potential function that has a unified formulation to represent the rewards change of agents. In the case we focus on, there is a global potential function

ϕ : S \times Π \to R

that for any agent should satisfy Definition 1.

Definition 1.

An SG is an SPG if a global potential function can describe the differential of all agents’ rewards (cost matrices). For any

i \in N

,

R^{i} [s, (a^{i}, a^{- i})] - R^{i} [\hat{s}, ({\hat{a}}^{i}, a^{- i})] = ϕ [s, (a^{i}, a^{- i})] - ϕ [\hat{s}, ({\hat{a}}^{i}, a^{- i})],

(12)

where

a^{- i} : = \{a^{1}, a^{2}, \dots, a^{i - 1}, a^{i + 1}, \dots, a^{N}\}, \forall i \in N

is the joint action except

a^{i}

.

{\hat{a}}^{i}

is the next step of agent i, and

{\hat{s}}^{i}

represents the new state caused by

{\hat{a}}^{i}

.

If all agents have the same form of reward function, then Definition 1 is naturally satisfied. However, in circumstances where reward functions are not specific or not in a unified format, finding a potential function in a definite form becomes difficult. Therefore, demonstrating the existence of a potential function is an alternative solution to prove that a SG satisfies the conditions of an SPG. Based on Definition 1, the variant equation could be established as

\frac{\partial R^{i} [s, (a^{i}, a^{- i})]}{\partial a^{i}} = \frac{\partial ϕ [s, (a^{i}, a^{- i})]}{\partial a^{i}} .

(13)

Given the condition stated in Equation (13), Lemma 1 provides another way to prove that the described scenario satisfies the definition of an SPG. In Section 4, it serves as a precondition for deploying MARL methods to search for the NE strategy.

Lemma 1.

For any player i and j in the game environment, the game

G

is a continuous SPG if and only if their reward functions (cost matrices) satisfy

\frac{\partial^{2} R^{i} [s, (a^{i}, a^{- i})]}{\partial a^{i} \partial a^{j}} = \frac{\partial^{2} R^{j} [s, (a^{j}, a^{- j})]}{\partial a^{i} \partial a^{j}} .

(14)

In the case of MUAVS versus the target, the definition of reward functions is critical for building the game environment. To describe

R^{i}

, the following equation is defined to model the payoff during the training process of the algorithm:

R^{i} [s, (a^{i}, a^{- i})] = R_{s}^{i} (s, a^{i}) + R_{c}^{i} [s, (a^{i}, a^{- i})],

(15)

where we have the following:

$R_{s}^{i} = R_{col}^{i} (s^{i}) + R_{act}^{i} (a^{i})$ is called self-influence rewards, affected by the agent i obstacle avoidance and the smoothness of the action.
$R_{c}^{i} = R_{trac}^{i} (s^{i}, a^{i}, a^{- i}) + R_{sur}^{i} (s^{i}, a^{i}, a^{- i})$ represents the inter-influence rewards that include the tracking and encirclement state.

Specifically, depending on whether agent i is a tracker or the target,

R_{trac}^{i}

and

R_{sur}^{i}

have different expressions as shown in Equations (16) and (17):

R_{trac}^{i} = \{\begin{matrix} \frac{1}{N} f_{t r a c} (d_{t}^{i}, a^{i} - a^{j}), & if i is a Tracker \\ - \frac{1}{N} \sum_{j = 0}^{N} f_{t r a c} (d_{t}^{j}, a^{j} - a^{i}), & if i is the Target \end{matrix},

(16)

R_{sur}^{i} = \{\begin{matrix} f_{s u r} (ϑ_{u}^{i, - i}, a^{i}, a^{- i}), & if i is a Tracker (Target is not in - i) \\ 0, & if i is the Target \end{matrix} .

(17)

Note that both

f_{t r a c} (\cdot)

and

f_{s u r} (\cdot)

are defined as continuously differentiable functions for all

a^{i} \in π^{i}

. Additionally,

f_{t r a c} (\cdot)

is linear with respect to

a^{i} - a^{j}

. Detailed descriptions of these functions are provided in Section 4.2. The term

d_{t}^{i}

represents the distance between tracker i (or j) and target j (or i), which is one element of the state vector s.

Based on the fact that all agents in

N

have the same form of rewards function shown in Equation (15), it is feasible to demonstrate the existence of the potential function using Lemma 1. There are two scenarios for Equation (16): either both agents i and j are members of the MUAVS, or one of them is the target. Given that

R_{c}^{i}

is not affected by

a_{j}

, its partial derivative with respect to

a_{j}

is zero as shown in Equation (18). Consequently, the partial derivative of

R^{i}

can be transformed as in Equation (19):

\frac{\partial^{2} R_{s}^{i} (s, a^{i})}{\partial a^{i} \partial a^{j}} = 0,

(18)

\frac{\partial^{2} R^{i} [s, (a^{i}, a^{- i})]}{\partial a^{i} \partial a^{j}} = \{\begin{matrix} - \frac{1}{N} \frac{\partial^{2} f_{t r a c} (d_{t}^{i}, a^{i} - a^{j})}{\partial a^{i} \partial a^{j}}, & if j is the Target \\ \frac{\partial^{2} f_{s u r} (ϑ_{u}^{i, j}, a^{i}, a^{j})}{\partial a^{i} \partial a^{j}}, & if i and j are Trackers \end{matrix} .

(19)

When agent j represents the target, according to Equation (20), the partial derivative of

R_{j}

can be expressed as shown in Equation (21):

\frac{\partial^{2} f_{t r a c} (d_{t}^{k}, a^{k} - a^{j})}{\partial a^{i} \partial a^{j}} = 0, \forall k \in N and k \neq i,

(20)

\frac{\partial^{2} R^{j} [s, (a^{j}, a^{- j})]}{\partial a^{i} \partial a^{j}} = \{\begin{matrix} \frac{1}{N} \frac{\partial^{2} f_{t r a c} (d_{t}^{i}, a^{i} - a^{j})}{\partial a^{i} \partial a^{j}}, & if j is the Target \\ \frac{\partial^{2} f_{s u r} (ϑ_{u}^{j, i}, a^{j}, a^{i})}{\partial a^{i} \partial a^{j}}, & if i and j are Trackers \end{matrix} .

(21)

There are several points that should be noted:

Due to the relativity of the Polar Coordinates used in Section 2, $ϑ_{u}^{i, j} = ϑ_{u}^{j, i}$ when both agent i and j are members of MUAVS.
According to the linearity of $f_{t r a c} (\cdot)$ , Equation (22) holds when either i or j represents the target in the pair $(i, j)$ :

\frac{\partial^{2} f_{t r a c} (d_{t}^{i}, a^{i} - a^{j})}{\partial a^{i} \partial a^{j}} = 0 .

(22)

Based on the preceding discussion, both of the aforementioned scenarios satisfy the conditions stipulated in Lemma 1. Therefore, the existence of the potential function of the proposed cost metrics is proven.

Since [43] has proposed that efficiently searching for the NE in SG is PPAD-hard, some researchers have focused on particular scenarios to simplify the difficulty of finding the NE. In [38], it has been proven that solving an MDP can lead to finding the local NE of the SPG. The Markov Perfect Equilibrium (MPE), as presented in Definition 2, is closely related to the local NE in SPG. As an NE variant, it is an appropriate concept to link the MDP with game theories. Therefore, it is feasible to use stochastic dynamic optimization techniques like MARL to search for the MPE strategy in SPG.

Definition 2.

An action strategy

\bar{π} : = \{{\bar{π}}^{1}, {\bar{π}}^{2}, \dots, {\bar{π}}^{N}\} \in Π

is an MPE of game

G

if:

V_{\bar{π}}^{i} (s) \geq V_{π}^{i} (s), \forall π \in Π, \forall s \in S,

(23)

where

V^{i}

is the value function of MDP.

4. Time-Series Multi-Agent Soft Actor–Critic Approach

In this section, the TMSAC approach is introduced in detail to solve the SPG formulation constructed in Section 3. Since the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [24] became the common baseline of deep MARL algorithms, a large number of practical methods, such as Counterfactual Multi-Agent (COMA) [44] and Factored Multi-Agent Centralized (FACMAC) policy gradients [45], are consistently refreshing the ranking of the benchmark test results. However, designing MARL algorithms that are suitable for continuous action-state spaces and are robust in real-time control tasks is challenging and progressing slowly. To enhance the efficacy of algorithms, it is reasonable to build a framework that leverages the sequential nature of agents state information in continuous MARL methods, given the time continuity of tasks. Therefore, combined with the centralized training and distributed execution strategy indicated by MADDPG, we propose a new construction of AC that adopts the Gated Recurrent Unit (GRU) structure to handle continuously sequential state information.

4.1. State and Action Embedding

4.1.1. State Vector

To realize the end-to-end manipulation, the state vector

s^{i}

of agent i is defined to incorporate observations from airborne sensors, including sequential relative kinematics information and obstacle observations. The state vector can be represented as follows:

s^{i} = \{\begin{matrix} \{\begin{matrix} [κ_{u - t}^{1}, \dots, κ_{u - t}^{j}, \dots, κ_{u - t}^{N}] \times L, & O_{o b s}^{i} \end{matrix}\}, & if i is the Target \\ \{\begin{matrix} [κ_{u - t}^{i}, κ_{u - u}^{i}] \times L, & O_{o b s}^{i} \end{matrix}\}, & if i is a Tracker \end{matrix} .

(24)

For the target,

[κ_{u - t}^{1}, \dots, κ_{u - t}^{N}]

represents the tracker–target relative kinematics as shown in Equation (10). Meanwhile,

[κ_{u - t}^{i}, κ_{u - u}^{i}]

represents both the relative kinematics between tracker i and the target, and the relative kinematics among trackers within the group. As shown in Figure 3, the obstacle observation

O_{o b s}^{i}

can be obtained from the laser radar ranging system that is simplified to a

N_{o b s}

-dimension distances vector isometrically sampled from the detection circle centered on the agent. Obtaining the state vectors during the tracking process is facilitated by the advanced capabilities of modern airborne sensors. The variable L denotes the observation duration that specifies the duration for which the agents will retain and utilize the historical information.

4.1.2. Action Vector

Considering that the action outputs are inputs of the UAV speed and attitude controller, the ground velocity

v^{i}

and yaw rate

\dot{ψ^{i}}

have to be manipulated simultaneously, performing as a practical UAV navigation system. Given by the kinematics, the actions of agents

a^{i}

are designed in two dimensions:

a^{i} = f_{π^{i}} (s^{i}) = {[\begin{matrix} v^{i} & {\dot{ψ}}^{i} \end{matrix}]}^{T},

(25)

where

v^{i}

and

\dot{ψ^{i}}

are limited by Assumption 5.

4.2. Composite Rewards Function

In the SPG problem, a key challenge is designing the instructive reward functions that need to be optimized by the MARL algorithm. As shown in Equation (15), the total rewards of each agent are separated into two parts, self-influence rewards, and inter-influence rewards, whose essence is to measure the individual performance and the contribution of trackers simultaneously.

4.2.1. Self-Influence Rewards

Self-influence rewards serve as crucial indicators of an agent’s survival capability and comprise two key components. The first component focuses on control system stability—to ensure reliable and stable operation of the control system, the action sequences output by the algorithm must maintain smoothness and continuity while avoiding dramatic fluctuations. The second component concentrates on evaluating collision avoidance capabilities. In real-world scenarios, agents may encounter various types of obstacles, such as static buildings or dynamic obstacles. Based on these considerations, we decompose self-influence rewards into two specific reward functions: action smoothness reward

R_{a c t}

and collision avoidance reward

R_{col}

. These two rewards are quantitatively calculated through the following mathematical expressions to guide agents in learning safer and more stable behavioral strategies:

R_{act}^{i} (a^{i}) = - β_{act} \sum_{t = 0}^{N_{act}} |\frac{d a^{i}}{d t}|,

(26)

R_{col}^{i} (s^{i}) = R_{obs}^{i} (O_{obs}^{i}) + R_{int}^{i} (κ_{u - t}^{N}, κ_{u - t}^{i}) .

(27)

where

β_{act}

is the gain factor of the action change rate, and

N_{act}

means the size of the window used to sum. As indicated in Equation (26), collision avoidance rewards consist of obstacle-influence part

R_{obs}^{i}

and agent-influence part

R_{int}^{i}

, described as follows:

R_{obs}^{i} = \{\begin{matrix} β_{obs} \sum_{n = 0}^{N_{obs}} (l_{n} - l_{o b s}), & if \forall l_{o b s} \geq l_{n} > 0 \\ - β_{obs} N_{obs} l_{obs}, & if \exists l_{n} = 0 \end{matrix},

(28)

R_{int}^{i} = \{\begin{matrix} β_{int} \sum_{j = 0}^{N_{int}} (d^{j} - 2 l_{int}), & if \forall 2 l_{int} \geq d^{j} > 0 \\ - 2 β_{int} N_{int} l_{int}, & if \exists d^{j} = 0 \\ 0, & if \forall d^{j} > 2 l_{int} \end{matrix} .

(29)

where

β_{obs}

and

β_{int}

are respective gain factors. For

R_{obs}^{i}

,

l_{n}

indicates the detected distance of each unit in

N_{obs}

of agent i. When any

l_{n} = 0

, it means a collision has occurred, and in this case,

R_{obs}

should be set to a negative value to represent a penalty. Meanwhile, in

R_{int}^{i}

,

d^{j}

is the abbreviation of

d_{u}^{i, j}

and

d_{t}^{i}

, and

N_{int}

is the collection of agents whose

2 l_{int} \geq d^{j}

.

4.2.2. Inter-Influence Rewards

In this study, inter-influence rewards consist of two components, tracking utility

R_{trac}^{i}

and encirclement utility

R_{sur}^{i}

, which jointly measure the performance of the mission. In detail, the

{\dot{d}}_{t}^{i}

plays the main character in

R_{trac}^{i}

due to its expression of the kinematic changing rate. And

ϑ_{u}^{i, j}

is counted into

R_{sur}^{i}

to evaluate the encirclement status. Therefore, the aforementioned reward functions are denoted as

R_{trac}^{i} = \{\begin{matrix} \frac{β_{trac}}{N} (v_{\max}^{t} + v_{\max} - {\dot{d}}_{t}^{i}), & if i is a Tracker \\ - \frac{β_{trac}}{N} \sum_{j = 0}^{N} (v_{\max}^{t} + v_{\max} - {\dot{d}}_{t}^{i}), & if i is the Target \end{matrix},

(30)

R_{sur}^{i} = \{\begin{matrix} - β_{sur} ({(ϑ_{u}^{i, i + 1} - \frac{2 π}{N_{uav}})}^{2} + {(ϑ_{u}^{i - 1, i} - \frac{2 π}{N_{uav}})}^{2}), & if i is a Tracker \\ 0, & if i is the Target \end{matrix} .

(31)

As the identical definition in (1),

β_{trac}

and

β_{sur}

are gain factors. Note that the inverse proportional form of

R_{t r a c}^{i}

is designed to guide the agents’ relative distance rate close to their maximum speed, which means encouraging to select the fastest tracking strategy. Furthermore, it is crucial for each UAV agent to maintain an encirclement angle of

ϑ_{u}^{i - 1, i} = ϑ_{u}^{i, i + 1} = \frac{2 π}{N_{uav}}

.

4.3. Algorithm

In this subsection, we will discuss the algorithm framework with structures of the actor and critic. Inspired by the Soft Actor–Critic (SAC) [46] that revealed excellent performance in continuous tasks, the soft Q-value and strategy update methods are adopted in our research. Furthermore, to leverage the time-serial data processing capability of GRU, we propose an AC structure that combines Multilayer Perceptrons (MLPs) and a GRU to extract valuable information from the state vector.

4.3.1. Learning Methods

Based on the recursive Bellman operator, the soft Q-value for each agent i is defined as

Q_{s}^{i} (s_{t}^{i}, a_{t}^{i}) = R^{i} (s_{t}^{i}, a_{t}^{i}) + γ V^{i} (s_{t}^{i}),

(32)

V^{i} (s_{t}^{i}) = E_{(s_{t + 1}^{i}, a_{t + 1}^{i})} [Q_{s}^{i} (s_{t + 1}^{i}, a_{t + 1}^{i}) - α^{i} l n (π^{i} (a_{t}^{i} | s_{t}^{i}))],

(33)

where

α^{i}

plays as the temperature coefficient to control the percentage of information entropy, keeping a balance between the exploration and exploitation behaviors of the algorithm. And

γ

is the discount factor to influence the estimation length of the Q-value expectation. Given by the soft policy iteration, the optimal strategy is expressed as

π^{i *} = \arg \max \sum_{t}^{T} E [R^{i} (s_{t}, a_{t}) + α^{i} H (π^{i})],

(34)

where

H (π^{i}) = - \int π^{i} (a_{t}^{i} | s_{t}^{i}) l n (π^{i} (a_{t}^{i} | s_{t}^{i})) d a_{t}^{i}

represents the information entropy. The above equation indicates that the optimal strategy should not only maximize rewards but also take into account entropy to promote exploration during the initial stages of training. There are two main streams in continuous MARL action sample methods, DDPG-like deterministic action output and Gaussian distribution action sample implemented in SAC. Compared with the hard output, probabilistic sampling brings more stochasticity to avoid losing effective actions.

As is typical of most MARL methods, TMSAC employs neural networks to parameterize the actor and the critic, allowing for gradient backpropagation during training. The parameters of AC are denoted as

θ_{a}^{i}

and

θ_{c}^{i}

; also,

{\bar{θ}}_{c}^{i}

for target Critic prepared for updating. Meanwhile, a technique called Double-Critic is implemented, where each agent has two independent critic networks to calculate the soft Q-value. This approach helps to stabilize the training process. According to the Bellman residual definition, the loss function used in the stochastic gradient descent of critic has the following form:

\begin{matrix} J_{Q_{s}}^{i} (θ_{c}^{i}) = & \frac{1}{2} E_{(s_{t}^{i}, a_{t}^{i})} [(Q_{s}^{i} (s_{t}^{i}, a_{t}^{i}) - R^{i} (s_{t}^{i}, a_{t}^{i}) + \\ {γ E_{(s_{t + 1}^{i}, a_{t + 1}^{i})} [Q_{\bar{s}}^{i, \min} (s_{t + 1}^{i}, a_{t + 1}^{i})])}^{2}], \end{matrix}

(35)

where

Q_{\bar{s}}^{i, \min} = min \{Q_{\bar{s}}^{i, j} (s_{t + 1}^{i}, a_{t + 1}^{i}), j = 1, 2\}

is the minimal target soft Q-value from Double-Critic. Thus, the update differential expression of the critic could be written as follows:

\begin{matrix} \nabla_{θ_{c}^{i}} J_{Q_{s}}^{i} (θ_{c}^{i}) = & E [\nabla_{θ_{c}^{i}} Q_{s}^{i} (s_{t}^{i}, a_{t}^{i}) (Q_{s}^{i} (s_{t}^{i}, a_{t}^{i}) - (R^{i} (s_{t}^{i}, a_{t}^{i}) + \\ γ (Q_{\bar{s}}^{i, \min} (s_{t + 1}^{i}, a_{t + 1}^{i}) - α^{i} l n (π^{i} (a_{t + 1}^{i} | s_{t + 1}^{i})))))], \end{matrix}

(36)

where ∇ means the differential operator. In addition, a parameterized Gaussian distribution sampling function

G_{θ_{a}^{i}} (s_{t}^{i})

is utilized to model the action outputs. The actor loss function and its derivative are indicated as the following form to realize soft policy iteration:

J_{π}^{i} (θ_{a}^{i}) = E_{(s_{t}^{i}, a_{t}^{i})} [α^{i} l n (π_{θ_{a}^{i}}^{i} (G_{θ_{a}^{i}} (s_{t}^{i}) | s_{t}^{i})) - Q_{s}^{i, \min} (s_{t}^{i}, G_{θ_{a}^{i}} (s_{t}^{i}))],

(37)

\begin{matrix} \nabla_{θ_{a}^{i}} J_{π}^{i} (θ_{a}^{i}) = & E [\nabla_{θ_{a}^{i}} α^{t} l n (π_{θ_{a}^{i}}^{i} (a_{t}^{i} | s_{t}^{i})) + \\ \nabla_{θ_{a}^{i}} G_{θ_{a}^{i}} (s_{t}^{i}) (\nabla_{a^{i}} α^{t} l n (π_{θ_{a}^{i}}^{i} (a_{t}^{i} | s_{t}^{i})) - Q_{s}^{i, m i n} (s_{t}^{i}, a_{t}^{i}))], \end{matrix}

(38)

With the intention to reduce hyper-parameters of the algorithm, the

α^{i}

should adaptively update based on the following equation to increase exploration at the primary stage of training and restrain it when the entropy

H

approaches the target value

\hat{H}

:

J (α^{i}) = - α^{i} E_{(s_{t}^{i}, a_{t}^{i})} [l n (π_{θ_{a}^{i}}^{i} (a_{t}^{i} | s_{t}^{i})) + \hat{H}] .

(39)

4.3.2. Actor and Critic Structures

Unlike conventional approaches of using parameterized approximators in RL methods for TTE, our proposed method integrates recurrent units into the actor structure as illustrated in Figure 4a. Its inputs include the sequential relative kinematic vector

L \times κ^{i}

and obstacle observation

O_{o b s}^{i}

. Regarding the network architecture, the H1 layer of the actor structure, as shown in Figure 4a, is implemented as GRU and comprises 256 hidden units with

t a n h

activation function for new gates. This design is similar to that of Long Short-Term Memory (LSTM) but with fewer parameters, and has demonstrated comparable performance. MLP hidden layers H2 and H3 extract features from the upper layer outputs. Finally, the outputs of O1 and O2 represent the means and log standard deviations of Gaussian distribution, respectively, preparing for action sampling. On the other hand, the critic known as soft Q-value approximator should estimate the Q-value based on the current state. Therefore, a full MLP construction shown in Figure 4b is used to achieve this goal. Its input consists of the last kinematic step

κ^{i}

and observation

O_{o b s}^{i}

.

4.3.3. Details

The training process of TMSAC is shown in Algorithm 1. At the preparatory phase, all agents are born in the simulation environment, including the target with a stochastic initial speed and position. The agent i would be selected to train by the following switching rule in every episode:

f_{sw} (e p) = div (\mod (e p, M_{r}), M_{r} / N),

(40)

where

e p

is the current episode. And

M_{r}

represents the switching period. The AC parameters of each agent are alternately optimized for the local Nash Equilibrium (NE) while the parameters of other agents are frozen, following the aforementioned rule. In the beginning of one episode, state set

S

is stochastically initialized with the condition that no collision occurred when agents locate their positions firstly.

After the initialization, agents execute actions

A_{t}

sampled from their Actors, respectively, and interact with the environment to obtain new state

S_{t + 1}

and rewards

R_{t}

. Meanwhile, experience buffer

D

is used to store the step information set

\{S_{t}, A_{t}, R_{t}, S_{t + 1}\}

after the above interaction is accomplished. With the interval of

N_{g}

steps, agent i chose by Equation (41) update its AC parameters based on loss functions mentioned in (1). Note that the expectation

E

in loss functions is approximated by averaging a batch of data. The data flowchart in TMSAC is presented in Figure 5.

Algorithm 1 Time-Series Multi-Agent Soft Actor–Critic.

Input: Randomly initial

θ_{c}^{1, 1} \dots θ_{c}^{N, 1}

,

θ_{c}^{1, 2} \dots θ_{c}^{N, 2}

,

θ_{a}^{1} \dots θ_{a}^{N}

and

{\bar{θ}}_{c}^{i} \leftarrow θ_{c}^{i}

;

for

e p = 1

to M do:

Initialize the environment and agents state

S

;

Initialize the experience buffer

D

;

Select agent

i = f_{sw} (e p)

for training;

for

s t e p = 1

to T do:

Sample actions

A_{t} = {a_{t}^{1}, \dots, a_{t}^{N}}

from each

G_{θ_{a}^{i}} (s_{t}^{i})

;

Execute

A_{t}

and update state set

S_{t + 1}

from environment;

Obtain rewards

R_{t} = {R^{1}, \dots, R^{N}}

;

Pull

\{S_{t}, A_{t}, R_{t}, S_{t + 1}\}

into the experience buffer

D

;

if

s t e p % N_{g} = 0

:

Gather a batch of set from the experience buffer

D

;

Update

θ_{c}^{i, j}

,

θ_{a}^{i}

and

α^{i}

of agent i through Equations (35), (37) and (39);

Softly update

{\bar{θ}}_{c}^{i} = τ θ_{c}^{i} + (1 - τ) {\bar{θ}}_{c}^{i}

of agent i;

end if

end for

Output:

θ_{a}^{1} \dots θ_{a}^{N}

;

4.3.4. Convergence Discussion

Although Definition 2 reveals that using MARL methods to solve an MDP is equivalent to finding the local NE of a proposed SPG scenario, it is necessary to prove the convergence of the TMSAC algorithm, which is a stochastic approximation tool. Given by policy iteration in Equation (34), the policy updating scheme can be rewritten as Equation (41):

π_{i}^{new} = arg min_{π_{i}^{'} \in Π_{i}} D_{KL} (π_{i}^{'} (G_{θ_{a}^{i}} (s_{t}^{i}) | s_{t}^{i}) ∥ exp (\frac{1}{α} (Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i}) - V^{π_{i}^{old}} (s_{t}^{i})))) .

(41)

where

D_{KL}

is the Kullback–Leibler divergence, and

Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i})

and

V^{π_{i}^{old}} (s_{t}^{i})

are the soft state-action value and state value of agent i’s policy

π_{i}^{old}

. Let

π_{i}^{old} \in Π_{i}

, and

π_{i}^{new}

represents the optimizer of Equation (41). Only if

Q_{s}^{π_{i}^{new}} (s_{t}^{i}, a_{t}^{i}) \geq Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i})

for all agents can the proposed algorithm potentially guarantee convergence towards the MPE.

Considering that

J_{π_{i}^{old}} (π_{i}^{old}) \leq J_{π_{i}^{old}} (π_{i}^{new})

, then the following inequality can be established:

\begin{matrix} E_{(s_{t}^{i}, a_{t}^{i} \sim π_{i}^{new})} [α^{i} l n (π_{i}^{new} (G_{θ_{a}^{i}} (s_{t}^{i}) | s_{t}^{i})) - Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i}) + V^{π_{i}^{old}} (s_{t}^{i})] \\ \leq E_{(s_{t}^{i}, a_{t}^{i} \sim π_{i}^{old})} [α^{i} l n (π_{i}^{old} (G_{θ_{a}^{i}} (s_{t}^{i}) | s_{t}^{i})) - Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i}) + V^{π_{i}^{old}} (s_{t}^{i})] . \end{matrix}

(42)

As

V^{π_{i}^{old}} (s_{t}^{i})

is only affected by

s_{t}^{i}

, when subtracting

V^{π_{i}^{old}} (s_{t}^{i})

from both sides of the Inequality (42), it reduces to

E_{(s_{t}^{i}, a_{t}^{i} \sim π_{i}^{new})} [Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i}) - α^{i} l n (π_{i}^{new} (G_{θ_{a}^{i}} (s_{t}^{i}) | s_{t}^{i}))] \geq E_{(s_{t}^{i})} [V^{π_{i}^{old}} (s_{t}^{i})] .

(43)

Given by the soft Bellman operator mentioned in Section 4.3, the deductive inequality is shown as

\begin{matrix} Q_{s}^{π_{i}^{old}} (s_{t}^{i}, a_{t}^{i}) & = R^{i} (s_{t}^{i}, a_{t}^{i}) + γ V^{π_{i}^{old}} (s_{t + 1}^{i}) \\ \leq R^{i} (s_{t}^{i}, a_{t}^{i}) + γ E_{s_{t + 1}^{i}} [Q_{s}^{π_{i}^{old}} (s_{t + 1}^{i}, a_{t + 1}^{i}) - α^{i} l n π_{i}^{new} (G_{θ_{a}^{i}} (s_{t + 1}^{i}) | s_{t + 1}^{i})] \\ \leq Q_{s}^{π_{i}^{new}} (s_{t}^{i}, a_{t}^{i}), \end{matrix}

(44)

where the convergence towards the optimal policy of each agent is proven.

5. Simulation and Discussion

In this section, we conduct experiments using the proposed algorithm under various environmental and conditional settings. More importantly, we provide comparative analyses between TMSAC and other continuous MARL methods utilized in the SPG. The implementation is developed using Python and the PyTorch deep learning framework for constructing learning models, while the Matplotlib library is used for visualizing the results. The main computing hardware in the training platform is NVIDIA RTX 3090 GPU and AMD Ryzen 5950X CPU. Additionally, a visual simulation that takes into account the dynamic characteristics of the agents is provided, demonstrating the real-time deployability of the algorithm. Since the actor employs a relatively simple neural network architecture, it has low computational hardware requirements during the inference phase after training. This makes it suitable for deployment on embedded platforms.

5.1. Algorithm Performance

Considering the small-size quadrotor as trackers and target in all described simulation scenarios, the environment and agent capability parameters are indicated in Table 1. The algorithm hyper-parameters for agents are shown in Table 2. To verify the effectiveness of the algorithm under different environmental conditions, we first conduct simulations in an obstacle-free environment. The MUAVS consists of four to five UAVs. Based on the performance parameters of small quadrotors, the maximum flight speed of each agent is set to 10 m/s, and the maximum angular velocity is set to 15 deg/s. We deploy the MUAVS in a 1000 m × 1000 m two-dimensional simulation area without any obstacle disturbing tracking process, to verify the TTE performance preliminarily. It is important to emphasize that the initial positions and orientations of both the UAV swarm and the target are randomly generated, and the agent’s safe distance

l_{int}

is set to 20 m to mitigate the risk of collisions.

As the verification results show in Figure 6, four typical tracking and encirclement evaluation scenarios are presented within four or five trackers after the training. On the trackers side, their main objective is to proactively converge towards the target and encircle it in a coordinated manner. The punishment incurred from the safe interval reward has incentivized the UAVs in Figure 6c to learn to encircle the target instead of getting too close to it, resulting in higher rewards. This demonstrates the practicality and intelligence of the algorithm. On the target side, it exhibits an evasion tendency while trackers are getting closer in all presented scenarios that conform to a rational response in a real pursuit–evasion situation. Under four different initialization conditions, all TTE tasks are completed within 100 steps (seconds). UAVs selected flight speeds greater than 10 m/s in the early stages of tracking to quickly approach the target, and matched speed with the target after approaching.

More importantly, the TTE performance of TMSAC in the unknown environment is revealed in Figure 6 which employs four types of obstacles distribution in simulation while four UAVs are involved in the tracker group. In Figure 6e,f, there are several isolated geometrical objects deployed in the area, respectively, meanwhile trackers and evader are born in the rest of the open area. On the other hand, narrow channels with different humps are set in the environment shown in Figure 6g,h. The results demonstrate that the agents effectively avoid collisions during both the pursuit and evasion phases.

5.2. Comparison with Other Methods

Intending to demonstrate the effectiveness and efficiency of the proposed algorithm, we also implement two additional baseline methods, MADDPG and Multi-Agent Soft Actor–Critic (MASAC), with nearly identical hyper-parameters in the same environment. In particular, the MASAC has the same algorithm framework as MADDPG but replaces the loss function using the Max Entropy Learning theory. Both of the two methods, MADDPG and MASAC, adopt fully connected MLPs as their actors and critics, which have similar structures to TMSAC but do not include any GRU layers. Different numbers of UAVs are deployed in the simulation as trackers, and training rewards are shown in Figure 7, where UAV rewards are the average values of all swarm members. According to the rewards figures, there are several learning characteristics of algorithms that can be summarized as follows:

Higher rewards value. In all cases of comparison, TMSAC obtains 5–20% more rewards than other methods with the same environment and hyper-parameters settings, both for trackers and target, simultaneously. That means strategies approximated by the TMSAC algorithm are closer to MPE as mentioned in Section 3.
Adaptivity for more agents scenarios. For fully distributed algorithms, increasing the number of agents makes it more challenging to search for NE strategies. From Figure 7a–e, the advantage of TMSAC is progressively obvious on the trackers side. The recurrent network structure gives a measure of prediction ability for agent to avoid potential collision, and select effective actions for the predictive tracking. The mounting agents group brings more kinematic information into state vectors. The results demonstrate that as the number of trackers in the environment increases, TMSAC exhibits enhanced robustness in identifying near-Nash Equilibrium policies.

After evaluating the training efficiency, we analyze the success rate and the number of steps required for the successful tracking process under different conditions. To evaluate the algorithms, we define two performance metrics:

Full Success Rate (FSR): This metric measures the percentage of scenarios in which all trackers in the MUAVS successfully capture their targets within a limited number of time steps.
Half Success Rate (HSR): This metric quantifies the percentage of scenarios, in which at least half of the trackers achieve their goals.

The criteria for tracking success are specified as

d_{t}^{i} \in l_{int} \pm 5 m

. And the time steps usage represents the percentage of tracking success steps over total steps T. As shown in Figure 8a, with the random obstacles, our algorithm achieves higher scores averaged from 100 test episodes both in the FSR and TSR metrics. Meanwhile, as can be seen in Figure 8b, the slight advantage of TMSAC in time steps usage is verified.

Furthermore, we evaluate the performance of different algorithms as the number of UAVs is increased. As shown in Figure 9, we gradually increase the number of deployed UAVs in an environment with random obstacles. The TMSAC algorithm demonstrates superior scalability compared to the other two algorithms, maintaining a stable task success rate as the number of game participants is increased. Specifically, when the number of UAVs is less than 10, the success rate of TTE tasks executed by navigators trained with TMSAC and MASAC algorithms is not significantly affected. When the number of UAVs exceeds 10, the environmental complexity begins to significantly impact the task success rates of all three algorithms. When the number of UAVs reaches 20, the TMSAC task success rate drops by 8% compared to the conditions of the 5 UAVs, MASAC drops by 10%, and MADDPG drops by nearly 20%. It should be noted that all three algorithms use the CEDT architecture, and increasing the number of agents does not affect the algorithms’ convergence. However, under the same training conditions, different algorithms show varying levels of performance.

5.3. Visual Simulation

Based on the above works, we build a comprehensive dynamic simulation platform derived from gym-pybullet-drones [47], which is a multifunctional end-to-end training environment for UAV intelligent control methods. It uses the PyBullet physical engine as a dynamics solver. The proficiently trained actors function as navigators in the platform, complemented by the use of Proportional–Integral–Derivative (PID) attitude and velocity controllers to establish a closed-loop negative feedback control system. For the real-time dynamic simulation, Table 3 describes the physical parameters of agents both for the MUAVS and target.

The system block diagram is shown in Figure 10, where the SPG-based navigator receives embedded kinematics and environment observation, and outputs the expectations of plane velocity

v_{ref}^{i}

and yaw velocity

{\dot{ψ}}_{ref}^{i}

. The classic PID velocity and attitude controller calculate propeller forces that simplify the expression without motor rotation speed. Then, the UAV dynamic model introduced in Section 2 updates three axes’ velocities and angular velocities for the SPG relative kinematics. Given by this framework, the visual simulation displayed in Figure 11 is executed at an iteration frequency of (a) 120 Hz and a control frequency of (b) 60 Hz to evaluate the performance of the actors under real-time conditions.

In one iteration of the virtual experiment, the dynamic responses of all agents’ positions and attitudes in all three axes are recorded and presented in Figure 12. The initial conditions are exactly the same as those of the above simulations. It can be seen that both the X and Y positions converge to smaller biases, with velocities and angular velocities controlled within reasonable ranges. Note that the yaw angle is defined in

[- π, π]

; therefore, there exist some situations of numerical mutation in the yaw angle records.

6. Conclusions

Facing the unknown environment and target evasion problems, we proposed the TMSAC, a multi-agent reinforcement learning algorithm to approximate agent strategies that converge to the MPE in the SPG abstracted from TTE scenarios. By integrating a reward function design compliant with the SPG paradigm and a state-action space design meeting practical deployment requirements, our proposed algorithm demonstrates exceptional performance in MUAVS executing TTE tasks, while also showing potential for scalability and large-scale applications. Furthermore, a series of simulations was conducted to validate the characteristics of the proposed algorithm. These simulations involved performance tests and virtual dynamic experiments. Simulation results demonstrate that even under conditions of unknown target dynamic escape and environmental disturbances, MUAVS can still effectively track the target and achieve circumnavigation flight. Comparison of the training rewards demonstrated that the proposed TMSAC algorithm outperformed the two baseline algorithms in terms of both training rewards and tracking success rate. Additionally, the dynamic experiments provided evidence of the real-time deployability of the TMSAC method.

In terms of future work, the key issue is to enhance approaches for solving the Nvs.M TTE problem by incorporating task allocation and target-choosing mechanisms. This will broaden the scope of approaches and enable them to operate in more complex environments. By incorporating hierarchical learning techniques, the model and algorithm proposed in this paper show potential for further expansion and research.

Author Contributions

Conceptualization, K.Y. and M.Z.; methodology, K.Y. and X.G.; software, K.Y., Y.Z. (Yifei Zhang) and Y.Z. (Yuting Zhou); validation, K.Y. and X.G.; formal analysis, M.Z.; investigation, K.Y. and X.G.; writing—original draft preparation, K.Y.; writing—review and editing, M.Z., X.G. and Y.Z. (Yifei Zhang); visualization, K.Y.; supervision, M.Z.; project administration, M.Z.; funding acquisition, Y.Z. (Yifei Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. 501JCGG2024129007).

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TTE	Target Tracking and Encirclement
MUAVS	Multiple Unmanned Aerial Vehicles System
AC	Actor–Critic
MARL	Multi-Agent Reinforcement Learning
PG	Potential Game
NE	Nash Equilibrium
SPG	Stochastic Potential Game
SG	Stochastic Game
MDP	Markov Decision Process
MPE	Markov Perfect Equilibrium
TMSAC	Time-Series Multi-Agent Soft Actor–Critic
MASAC	Multi-Agent Soft Actor–Critic
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
GRU	Gated Recurrent Unit
MLPs	Multilayer Perceptrons

References

Erdelj, M.; Król, M.; Natalizio, E. Wireless sensor networks and multi-UAV systems for natural disaster management. Comput. Netw. 2017, 124, 72–86. [Google Scholar] [CrossRef]
Outay, F.; Mengash, H.A.; Adnan, M. Applications of unmanned aerial vehicle (UAV) in road safety, traffic and highway infrastructure management: Recent advances and challenges. Transp. Res. Part A Policy Pract. 2020, 141, 116–129. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous navigation of UAVs in large-scale complex environments: A deep reinforcement learning approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
Chong, C.Y.; Garren, D.; Grayson, T.P. Ground target tracking-a historical perspective. In Proceedings of the 2000 IEEE Aerospace Conference. Proceedings (Cat. No. 00TH8484), Big Sky, MT, USA, 25 March 2000; Volume 3, pp. 433–448. [Google Scholar]
Luo, C.; McClean, S.I.; Parr, G.; Teacy, L.; De Nardi, R. UAV position estimation and collision avoidance using the extended Kalman filter. IEEE Trans. Veh. Technol. 2013, 62, 2749–2762. [Google Scholar] [CrossRef]
Mao, G.; Drake, S.; Anderson, B.D. Design of an extended kalman filter for uav localization. In Proceedings of the 2007 Information, Decision and Control, Adelaide, SA, Australia, 12–14 February 2007; pp. 224–229. [Google Scholar]
Rullán-Lara, J.L.; Salazar, S.; Lozano, R. Real-time localization of an UAV using Kalman filter and a Wireless Sensor Network. J. Intell. Robot. Syst. 2012, 65, 283–293. [Google Scholar] [CrossRef]
Xiong, J.J.; Zheng, E.H. Optimal kalman filter for state estimation of a quadrotor UAV. Optik 2015, 126, 2862–2868. [Google Scholar] [CrossRef]
Leven, W.F.; Lanterman, A.D. Unscented Kalman filters for multiple target tracking with symmetric measurement equations. IEEE Trans. Autom. Control 2009, 54, 370–375. [Google Scholar] [CrossRef]
Kamen, E. Multiple target tracking based on symmetric measurement equations. In Proceedings of the 1989 American Control Conference, Pittsburgh, PA, USA, 21–23 June 1989; pp. 2690–2695. [Google Scholar]
Gulati, D.; Zhang, F.; Clarke, D.; Knoll, A. Graph-based cooperative localization using symmetric measurement equations. Sensors 2017, 17, 1422. [Google Scholar] [CrossRef]
Quintero, S.A.; Copp, D.A.; Hespanha, J.P. Robust UAV coordination for target tracking using output-feedback model predictive control with moving horizon estimation. In Proceedings of the 2015 American Control Conference (ACC), Chicago, IL, USA, 1–3 July 2015; pp. 3758–3764. [Google Scholar]
Quintero, S.A.; Copp, D.A.; Hespanha, J.P. Robust Coordination of Small UAVs for Vision-Based Target Tracking Using Output-Feedback MPC with MHE. In Cooperative Control of Multi-Agent Systems: Theory and Applications; Wiley: Hoboken, NJ, USA, 2017; pp. 51–83. [Google Scholar]
Shen, C.; Shi, Y.; Buckham, B. Path-following control of an AUV: A multiobjective model predictive control approach. IEEE Trans. Control Syst. Technol. 2018, 27, 1334–1342. [Google Scholar] [CrossRef]
Gao, Y.; Bai, C.; Zhang, L.; Quan, Q. Multi-UAV cooperative target encirclement within an annular virtual tube. Aerosp. Sci. Technol. 2022, 128, 107800. [Google Scholar] [CrossRef]
Li, K.; Han, Y.; Yan, X. Distributed multi-UAV cooperation for dynamic target tracking optimized by an SAQPSO algorithm. ISA Trans. 2022, 129, 230–242. [Google Scholar]
Xie, R.; Dempster, A.G. An on-line deep learning framework for low-thrust trajectory optimisation. Aerosp. Sci. Technol. 2021, 118, 107002. [Google Scholar] [CrossRef]
Xia, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y.; Li, G.; Han, Z. Multi-agent reinforcement learning aided intelligent UAV swarm for target tracking. IEEE Trans. Veh. Technol. 2021, 71, 931–945. [Google Scholar] [CrossRef]
Wenhong, Z.; Jie, L.; Zhihong, L.; Lincheng, S. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar]
Ao, T.; Zhang, K.; Shi, H.; Jin, Z.; Zhou, Y.; Liu, F. Energy-Efficient Multi-UAVs Cooperative Trajectory Optimization for Communication Coverage: An MADRL Approach. Remote. Sens. 2023, 15, 429. [Google Scholar] [CrossRef]
Zheng, Z.; Cai, S. A collaborative target tracking algorithm for multiple UAVs with inferior tracking capabilities. Front. Inf. Technol. Electron. Eng. 2021, 22, 1334–1350. [Google Scholar] [CrossRef]
Yang, Y.; Wang, J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv 2020, arXiv:2011.00583. [Google Scholar]
Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Silver, D.; Graepel, T. A unified game-theoretic approach to multiagent reinforcement learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4193–4206. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
Vamvoudakis, K.G.; Modares, H.; Kiumarsi, B.; Lewis, F.L. Game theory-based control system algorithms with real-time reinforcement learning: How to solve multiplayer games online. IEEE Control Syst. Mag. 2017, 37, 33–52. [Google Scholar]
Jiang, J.; Dun, C.; Huang, T.; Lu, Z. Graph convolutional reinforcement learning. arXiv 2018, arXiv:1810.09202. [Google Scholar]
Wang, J.; Ye, D.; Lu, Z. More centralized training, still decentralized execution: Multi-agent conditional policy factorization. arXiv 2022, arXiv:2209.12681. [Google Scholar]
Meng, X.; Tan, Y. PMAC: Personalized Multi-Agent Communication. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024; Volume 38, pp. 17505–17513. [Google Scholar]
Hu, J.; Wellman, M.P. Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 2003, 4, 1039–1069. [Google Scholar]
Vamvoudakis, K.G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica 2015, 61, 274–281. [Google Scholar] [CrossRef]
Holt, C.A.; Roth, A.E. The Nash equilibrium: A perspective. Proc. Natl. Acad. Sci. USA 2004, 101, 3999–4002. [Google Scholar] [CrossRef]
Perkins, S.; Mertikopoulos, P.; Leslie, D.S. Mixed-strategy learning with continuous action sets. IEEE Trans. Autom. Control 2015, 62, 379–384. [Google Scholar] [CrossRef]
Margellos, K.; Lygeros, J. Hamilton–Jacobi formulation for reach–avoid differential games. IEEE Trans. Autom. Control 2011, 56, 1849–1861. [Google Scholar] [CrossRef]
Kokolakis, N.M.T.; Kanellopoulos, A.; Vamvoudakis, K.G. Bounded rational unmanned aerial vehicle coordination for adversarial target tracking. In Proceedings of the 2020 American Control Conference (ACC), Denver, CO, USA, 1–3 July 2020; pp. 2508–2513. [Google Scholar]
Monderer, D.; Shapley, L.S. Potential games. Games Econ. Behav. 1996, 14, 124–143. [Google Scholar] [CrossRef]
Deng, X.; Li, N.; Mguni, D.; Wang, J.; Yang, Y. On the complexity of computing markov perfect equilibrium in general-sum stochastic games. Natl. Sci. Rev. 2023, 10, nwac256. [Google Scholar] [CrossRef]
Babichenko, Y.; Rubinstein, A. Communication complexity of Nash equilibrium in potential games. In Proceedings of the 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), Durham, NC, USA, 16–19 November 2020; pp. 1439–1445. [Google Scholar]
Mguni, D.H.; Wu, Y.; Du, Y.; Yang, Y.; Wang, Z.; Li, M.; Wen, Y.; Jennings, J.; Wang, J. Learning in nonzero-sum stochastic games with potentials. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 7688–7699. [Google Scholar]
Yu, Y.; Wang, H.; Liu, S.; Guo, L.; Yeoh, P.L.; Vucetic, B.; Li, Y. Distributed multi-agent target tracking: A Nash-combined adaptive differential evolution method for UAV systems. IEEE Trans. Veh. Technol. 2021, 70, 8122–8133. [Google Scholar] [CrossRef]
Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: Berlin/Heidelberg, Germany, 2021; pp. 321–384. [Google Scholar]
Beard, R.W. Quadrotor Dynamics and Control; Brigham Young University: Provo, UT, USA, 2008; Volume 19, pp. 46–56. [Google Scholar]
Luukkonen, T. Modelling and Control of Quadcopter; Independent Research Project in Applied Mathematics, Aalto University: Espoo, Finland, 2011; Volume 22. [Google Scholar]
Chen, X.; Deng, X.; Teng, S.H. Settling the complexity of computing two-player Nash equilibria. J. ACM 2009, 56, 1–57. [Google Scholar] [CrossRef]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Peng, B.; Rashid, T.; Schroeder de Witt, C.; Kamienny, P.A.; Torr, P.; Böhmer, W.; Whiteson, S. Facmac: Factored multi-agent centralised policy gradients. Adv. Neural Inf. Process. Syst. 2021, 34, 12208–12221. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Panerati, J.; Zheng, H.; Zhou, S.; Xu, J.; Prorok, A.; Schoellig, A.P. Learning to fly—A gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 7512–7519. [Google Scholar]

Figure 1. Dynamic model of quadrotor. Four motors that drive propellers to generate lift force

F_{1}

–

F_{4}

are symmetrically mounted about the

O_{b} X_{b}

and

O_{b} Y_{b}

, l and b meters away from them, respectively.

M_{x}

,

M_{y}

, and

M_{z}

are moments of lift force, based on the aforementioned BRF axes.

Figure 1. Dynamic model of quadrotor. Four motors that drive propellers to generate lift force

F_{1}

–

F_{4}

are symmetrically mounted about the

O_{b} X_{b}

and

O_{b} Y_{b}

, l and b meters away from them, respectively.

M_{x}

,

M_{y}

, and

M_{z}

are moments of lift force, based on the aforementioned BRF axes.

Figure 2. The SPG environment. Agents in MUAVS share relative kinematic information

κ_{u - u}

, while avoiding potential collisions. The target interacts with the MUAVS, where we represent the relative position using

κ_{u - t}

.

Figure 2. The SPG environment. Agents in MUAVS share relative kinematic information

κ_{u - u}

, while avoiding potential collisions. The target interacts with the MUAVS, where we represent the relative position using

κ_{u - t}

.

Figure 3. The obstacle detection and interval keeping ranges of agents in the system. Limited by the sensors performance, agents have effective detection range

l_{d e t}

. For safety consideration, each agent has to keep

d^{i, j} \geq 2 l_{i n t}

.

Figure 3. The obstacle detection and interval keeping ranges of agents in the system. Limited by the sensors performance, agents have effective detection range

l_{d e t}

. For safety consideration, each agent has to keep

d^{i, j} \geq 2 l_{i n t}

.

Figure 4. Actor and critic structures in TMSAC.

Figure 5. The data flowchart. Note that centralized training and distributive execution means the critic could obtain global state information for each agent, while actor uses the self-state to make a decision.

Figure 6. Tracking and encirclement performance without obstacles. The colorful lines depict the flight trajectories of the agents, whereas the black lines indicate the encirclement boundaries of the UAV swarm. SP means the start points, and EP represents the end locations. (a,b) are path records of 4 drones collaborating to track targets, (c,d) are 5 drones. Tracking and encirclement performance with obstacles in the environment. Gray areas are abstractions of obstacles. 4 subfigures (e–h) show the tracking performance of drone swarms in four different obstacle environments.

Figure 7. UAV swarm averaged rewards and target rewards in different conditions. (a) UAVs averaged rewards (3 UAVs); (b) target rewards (3 UAVs); (c) UAVs averaged rewards (4 UAVs); (d) target rewards (4 UAVs); (e) UAVs averaged rewards (5 UAVs); (f) target rewards (5 UAVs).

Figure 8. Tracking success rates and tracking steps usage of different algorithms. (a) Success rates. (b) Tracking times.

Figure 9. Tracking success rates under different number of UAVs.

Figure 10. Control system workflow in the visual simulation.

Figure 11. Screenshots from the visual simulation. Green marks are UAVs from Trackers Team, and the red mark represents the target.

Figure 12. Kinematics records of all agents involved in an integrated simulation. The fluctuation in y (rad) is due to the coordinate system setup. In the body coordinate system of the UAV, the yaw angle range is ±0.5 rad. When the UAV rotates clockwise beyond 0.5 rad, it directly jumps to −0.5 rad, although the actual yaw angle does not experience a sudden change. Regarding the fluctuations in the dy (rad/s) in the figure, this is due to the dynamic nature of the environment and the continuous adjustments made by the attitude controller to maintain formation.

Table 1. Environment and agent parameters.

Names	Values
Plane size	1000 m × 1000 m
Tracker max speed $v_{\max}$	10 m/s
Tracker max angular speed ${\dot{ψ}}_{\max}$	15 deg/s
Target max speed $v_{\max}^{t}$	10 m/s
Target max angular speed ${\dot{ψ}}_{\max}^{t}$	10 deg/s
Trackers number $N_{UAV}$	4 or 5
Target number	1
Agent safe interval $l_{int}$	20 m
Obstacle detection distance $l_{obs}$	8 m
Obstacle detection points $N_{obs}$	12
Obstacles shape and size	random

Table 2. Training hyper-parameters.

Names	Values
Total episodes M	$2 \times 10^{4}$
Total steps in each episode T	150
Switching period $M_{r}$	20 × N
Learning interval $N_{g}$	5
Batch size	1024
Target entropy $\hat{H}$	−2
Discount factor $γ$	0.988
Buffer size	$2^{17}$
Sequential length	10
Optimizer	Adam

Table 3. Physical parameters.

Names	Values
Gravity acceleration g	9.81
UAV mass m	0.25 kg
UAV arm length l and b	0.2 m
X-axis inertias $J_{x}$	$1.4 \times 10^{- 4}$ $kg \cdot m^{2}$
Y-axis inertias $J_{y}$	$1.4 \times 10^{- 4}$ $kg \cdot m^{2}$
Z-axis inertias $J_{z}$	$2.2 \times 10^{- 4}$ $kg \cdot m^{2}$
Max propeller force	7.23 N

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, K.; Zhu, M.; Guo, X.; Zhang, Y.; Zhou, Y. Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System. Drones 2025, 9, 103. https://doi.org/10.3390/drones9020103

AMA Style

Yang K, Zhu M, Guo X, Zhang Y, Zhou Y. Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System. Drones. 2025; 9(2):103. https://doi.org/10.3390/drones9020103

Chicago/Turabian Style

Yang, Kejie, Ming Zhu, Xiao Guo, Yifei Zhang, and Yuting Zhou. 2025. "Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System" Drones 9, no. 2: 103. https://doi.org/10.3390/drones9020103

APA Style

Yang, K., Zhu, M., Guo, X., Zhang, Y., & Zhou, Y. (2025). Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System. Drones, 9(2), 103. https://doi.org/10.3390/drones9020103

Article Menu

Stochastic Potential Game-Based Target Tracking and Encirclement Approach for Multiple Unmanned Aerial Vehicles System

Abstract

1. Introduction

2. System Modeling

2.1. Quadrotor Dynamics

2.2. Multi-Agent Kinematics

3. Stochastic Potential Game

4. Time-Series Multi-Agent Soft Actor–Critic Approach

4.1. State and Action Embedding

4.1.1. State Vector

4.1.2. Action Vector

4.2. Composite Rewards Function

4.2.1. Self-Influence Rewards

4.2.2. Inter-Influence Rewards

4.3. Algorithm

4.3.1. Learning Methods

4.3.2. Actor and Critic Structures

4.3.3. Details

4.3.4. Convergence Discussion

5. Simulation and Discussion

5.1. Algorithm Performance

5.2. Comparison with Other Methods

5.3. Visual Simulation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI