CN117579358B

CN117579358B - Multi-agent communication method, device, storage medium and electronic equipment

Info

Publication number: CN117579358B
Application number: CN202311586285.7A
Authority: CN
Inventors: 张俊格; 乔丹; 陈皓
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-09-06
Anticipated expiration: 2043-11-24
Also published as: CN117579358A

Abstract

The invention relates to a multi-agent communication method, a device, a storage medium and electronic equipment, comprising: based on the communication relation among the intelligent agents, constructing a distributed networked multi-intelligent-agent learning system, wherein the intelligent agents serve as task execution nodes, and the communication relation is described as edges; the intelligent agent executes local decision action based on the observed current global environment state and the self neural network to acquire a state-action rewarding value and an updated global environment state; sampling based on Laplace function distribution to obtain random noise information; combining the cost function estimation information with random noise information to generate privacy protection communication information, and establishing two-way communication channels with neighbor intelligent agents of the intelligent agents; according to the current state-action cost function estimation information, the received privacy protection communication receiving information, the environment feedback rewarding value and the new global environment state, the neural network is iteratively updated, and the communication safety performance guaranteed by a strict theory is improved.

Description

Multi-agent communication method, device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a multi-agent communication method, a device, a storage medium, and an electronic apparatus.

Background

In recent years, with the rapid development of communication technology and artificial intelligence, many real-life systems can be modeled as Multi-agent systems (MAS, multi-AGENT SYSTEM), such as sensor networks, networked autopilot vehicles, smart grids, unmanned warehouse systems, and the like. To enhance the autonomous decision-making and collaborative capabilities of agents, multi-agent reinforcement learning (MARL, multi-agent Reinforcement Learning) provides an efficient framework and training paradigm for these scenarios.

To solve the non-stationary problem caused by multi-agent communication, decision and learning, the MARL-based framework mainly adopts a Centralized training and decentralized execution (CTDE, centralized TRAINING WITH Decentralized Execution) algorithm. The CTDE algorithm generally assumes that a strong center exists in the training process, collects all local observations and individual actions of each agent, learns the joint optimal strategy of all agents based on the environmental state and the reward function corresponding to the joint actions, and distributes the joint optimal strategy to each agent. In the execution phase, each agent makes decisions based on its own local observations only. The centralized information architecture benefiting from CTDE algorithm develops a series of representative methods, mainly credit allocation, communication learning, strategy decomposition and the like, and the intelligent agent can fully utilize the interaction information of the centralized training stage to better understand the behaviors of the environment and other intelligent agents, greatly relieve the problem of non-stationary training, and help the intelligent agent to make actions more beneficial to teams, wherein the representative algorithm comprises: value decomposition networks (VDN, value Decomposition Networks), QMIX, multi-agent depth deterministic Policy gradients (MADDPG, multi AGENT DEEP DETERMINISTIC Policy Gradient), differentiable interaction learning (deal, differentiable INTER AGENT LEARNING), biCNet, and the like.

However, the CTDE algorithm cannot handle exponentially growing state-action space, i.e. curse of dimensions in the central controller. In addition, in the training process, massive information exchange between the central controller and the intelligent agent brings huge pressure to communication, and the mode of the centralized architecture also increases the systematic risk of single-point faults. Thus, another way to relax CTDE the algorithm limitations is to exploit the distributed architecture of the networked system to develop a decentralised training decentralised execution (DTDE, decentralized TRAINING AND Decentralized Execution) algorithm MARL. In the training process of DTDE algorithm, the available information of the agent is limited to the local neighbor agents within the communication range, instead of all agents in CTDE algorithm, so that potential information leakage and overfitting of other agent information can be avoided. In execution, the use of neighbor information forces agents to pay more attention to policy coordination among each other rather than just making decisions based on local observations of themselves.

In DTDE algorithm, the deployment, flexibility, system robustness and elasticity of MARL can be improved by using local information of communication network diffusion, but a unique problem is faced, namely, the reliability of neighbor information under network communication. In the related art, most MARL of DTDE algorithm frameworks assume that communication channels and members in a team are safe and reliable enough, and catastrophic damage of network attacks and malicious behaviors to the safety of the multi-agent reinforcement learning system is ignored, so that the communication safety among agents is not high in the multi-agent reinforcement learning method.

Disclosure of Invention

In view of this, the present invention provides a multi-agent communication method, apparatus, storage medium, and electronic device.

Specifically, the invention is realized by the following technical scheme:

According to a first aspect of the present invention, there is provided a multi-agent communication method comprising:

Constructing a distributed networked multi-agent learning system based on a communication relationship among agents, wherein the agents are execution unit nodes of the distributed networked multi-agent learning system, and the communication relationship is edges of the distributed networked multi-agent learning system;

Generating a local decision action by utilizing a Q-Learning algorithm based on a current global environment state observed by a target intelligent agent in a current decision period and state-action cost function estimation information expressed by a neural network, executing the local decision action in the current global environment state, and obtaining a reward value of the state-action cost function estimation information and a new global environment state from the environment;

Sampling based on Laplace function distribution according to preset differential privacy parameters to obtain random noise information;

generating privacy-preserving communication information based on the random noise information and the state-action cost function estimation information, and transmitting the privacy-preserving communication information to an agent having an edge with the target agent;

And receiving privacy protection communication receiving information sent by an agent with a side with the target agent, and updating the neural network according to the decision action taken by the target agent, the state-action cost function estimation information, the privacy protection communication receiving information, the reward value, the new global environment state and the current global environment state.

According to the multi-agent communication method in the technical scheme, a distributed network multi-agent learning system is built by taking agents as nodes and taking communication relations as edges based on communication relations among the agents; based on the current global environment state observed by the target intelligent agent and the neural network, acquiring local decision actions, executing the local decision actions, and obtaining rewarding values of the state-action cost function estimation information and new global environment states from the environment; sampling based on Laplace function distribution according to preset differential privacy parameters to obtain random noise information; generating privacy protection communication information based on the random noise information and the state-action cost function estimation information, and transmitting the privacy protection communication information to an agent with edges to the target agent; the neural network is updated based on the state-action cost function estimation information, the privacy preserving communication reception information, the reward value, the new global environment state, and the current global environment state. In this way, the random noise obtained by sampling is utilized to add the state-action cost function estimation information, so that the state-action cost function estimation information for communication is protected from being restored, and the safety of communication is improved; meanwhile, by means of neural network updating, under the condition that received information is interfered, convergence and privacy protection of noisy state-action cost function estimation information can be guaranteed, and high-quality strategy coordination and cooperative communication among multiple agents are achieved.

According to a second aspect of the present invention, there is provided a multi-agent communication device comprising:

The system construction module is used for constructing a distributed networked multi-agent learning system based on a communication relation among agents, wherein the agents are execution unit nodes of the distributed networked multi-agent learning system, and the communication relation is an edge of the distributed networked multi-agent learning system;

The state action module is used for generating a local decision action by utilizing a Q-Learning algorithm based on the current global environment state observed by a target intelligent agent in the current decision period and state-action cost function estimation information expressed by a neural network, executing the local decision action in the current global environment state, and obtaining a rewarding value of the state-action cost function estimation information and a new global environment state from the environment;

The noise acquisition module is used for sampling based on Laplace function distribution according to preset differential privacy parameters to acquire random noise information;

The privacy protection module is used for generating privacy protection communication information based on the random noise information and the state-action cost function estimation information and sending the privacy protection communication information to an agent with an edge with the target agent;

And the policy updating module is used for receiving privacy protection communication receiving information sent by an agent with an edge with the target agent, and updating the neural network according to the decision action taken by the target agent, the state-action cost function estimation information, the privacy protection communication receiving information, the reward value, the new global environment state and the current global environment state.

According to a third aspect of the present invention there is provided a storage medium having stored thereon a computer program which when executed by a processor implements the steps of the multi-agent communication method in any possible implementation of the first aspect.

According to a fourth aspect of the present invention there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multi-agent communication method in any possible implementation of the first aspect when the program is executed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described below, and it will be apparent to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of a multi-agent communication method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of a distributed network multi-agent learning system in a multi-agent communication method according to an embodiment of the present invention;

Fig. 3 is a schematic performance diagram of a multi-agent communication method in a single-lane deceleration following scene according to an embodiment of the present invention;

fig. 4 is a schematic performance diagram of a multi-agent communication method in a single-lane acceleration following scene according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a multi-agent communication processing device according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the related art, most MARL of DTDE algorithm frameworks assume that communication channels and members in a team are safe and reliable enough, and catastrophic damage of network attacks and malicious behaviors to the safety of the multi-agent reinforcement learning system is ignored, so that the communication safety among agents is not high in the multi-agent reinforcement learning method.

In this embodiment, a multi-agent communication method based on differential privacy (DP, differential Privacy) protection is provided, which can be applied to a networked multi-agent reinforcement learning method or a medium-networked multi-agent reinforcement learning system. The differential privacy is from the fields of network security and machine learning, and by adding uncorrelated random noise information to communication information in a communication channel, on the premise of not influencing the operation of the multi-agent reinforcement learning system, the communication information can not be restored to real communication information by a third party after being maliciously intercepted, so that the safety and user privacy of the networked multi-agent reinforcement learning system are improved. Therefore, the embodiment utilizes a differential privacy protection mechanism, and by designing additive noise attenuated with time, random noise is sampled from time-varying Laplace distribution, and the sampled random noise is added with source communication information, so that the source communication information is protected from being restored. Meanwhile, in the embodiment, by designing a corresponding agent policy updating mechanism, under the condition that the received information is interfered, the convergence and privacy protection of the noisy communication information can be ensured, and high-quality policy coordination and cooperation among multiple agents are realized.

Referring to fig. 1, an embodiment of the present invention provides a multi-agent distributed reinforcement learning communication method of a networked system, which may include the following steps:

S101, constructing a distributed networked multi-agent learning system based on a communication relation among agents, wherein the agents are execution unit nodes of the distributed networked multi-agent learning system, and the communication relation is an edge of the distributed networked multi-agent learning system;

In this embodiment, for a certain application scenario, a state action cost function is respectively constructed for N agents in the application scenario, and a communication topology between the agents is set to be a randomly switched and jointly communicated undirected graph g= { V, E, a }, where V represents a node, that is, the agents, E represents a communication edge between the nodes, and a represents an adjacency matrix.

In this embodiment, the adjacency matrix element between communicable nodes is 1, and the adjacency matrix element between non-communicable nodes is 0. For example, for an agent (node), the adjacent matrix element in the corresponding adjacent matrix is 1 for an agent having an edge with the agent, and the adjacent matrix element is 0 for an agent having no edge with the agent. In this way, curse and communication pressure can be relieved for the dimension of the centralized training of the distributed execution algorithm.

In this embodiment, as an alternative embodiment, a distributed networked multi-agent learning system is constructed that performs decentralization training decentralization based on a networked multi-agent Markov (Markov) decision process.

S102, generating local decision actions by utilizing a Q-Learning algorithm based on the current global environment state observed by a target agent in a current decision period and state-action cost function estimation information expressed by a neural network, executing the local decision actions in the current global environment state, and obtaining rewarding values of the state-action cost function estimation information and new global environment states from the environment;

In this embodiment, the local decision action is performed to obtain the rewarding function and the updated global environmental state of the environmental feedback, maintain the local state-action cost function estimation, and use the state-action cost function information Q (s, a) of the "state-action" pair as the real information, that is, obtain the rewarding function and the updated global environmental state of the environmental feedback, and maintain the local state-action cost function estimation information.

In this embodiment, during initial communication, each agent may observe the global environmental state S, generate an independent local decision action a _i according to the global environmental state and its own neural network, execute the independent local decision action a _i in the environment, and maintain local state-action cost function estimation information, so as to send the information to other agents for mutual communication and cooperation. Taking the network-linked collaborative autopilot scenario as an example, each agent is an autopilot with a vehicle-to-vehicle (Vehicle to Vehicle, V2V) communication function, multiple agents travel together on the road and can communicate with other agents in real time, and the joint driving behavior of all vehicles affects the global environmental state of the area, including but not limited to: the local decision actions include, but are not limited to, the vehicle position of each lane, the distance between the front and rear vehicles, the speed and acceleration of each vehicle, the congestion condition of each lane, the signal light status of each lane, the driving route, etc.: vehicle distance control coefficients, vehicle speed gain coefficients (including acceleration, deceleration, uniform speed) and the like. Regarding the independent local decision actions, see the related art documents specifically and set according to actual scene needs, and detailed descriptions are omitted herein.

S103, sampling is carried out based on Laplace function distribution according to preset differential privacy parameters, and random noise information is obtained;

In this embodiment, sampling is performed based on the laplace function distribution and the privacy protection mechanism, and random noise information is obtained. And setting a differential privacy protection mechanism to carry out privacy protection on the communication information. Each agent samples from the time-varying Laplace function distribution to obtain random noise information.

In this embodiment, in order to protect communication information, an additive laplace noise mechanism that attenuates with time is adopted, and by acquiring additive laplace noise information (random noise information) that attenuates with time, privacy protection can be performed on real communication information. At each communication, random noise information η _i (t) is sampled from the Laplace (Laplace) distribution as follows, using the following equation:

η_i(t)～Lap(0,ι_i(t))

Iota _i (t) is a variance parameter of noise distribution, and is used for determining the distribution of Laplace noise; s _i and q _i are differential privacy parameters, and respectively determine the initial distribution and attenuation rate of noise; s _gain is a gain factor for adjusting the scale of the noise level. Each parameter is a configurable positive constant, and the following conditions need to be satisfied: s _i、q_i、s_gain∈(0,1);s_i、q_i>0,s_gain is more than or equal to 0.

S104, generating privacy protection communication information based on the random noise information and the state-action cost function estimation information, and sending the privacy protection communication information to an agent with edges to the target agent;

In this embodiment, random noise information is added to real communication information to construct private communication information, that is, private protection communication information, so as to exchange private communication information with neighbor agents within a communication range.

In this embodiment, as an alternative embodiment, the privacy-preserving communication information is generated using the following formula:

In the method, in the process of the invention, The information is estimated for the state-action cost function,Protecting the communication information for privacy.

In this embodiment, the target agent and the neighbor agent in the communication range at the current moment establish a bidirectional communication channel in a broadcast manner, exchange information after privacy protection, that is, the target agent has a communication relationship with the target agent by setting the communication range, that is, each agent is connected with the target agent by an edge, and the communication information is limited to the local neighbor agent in the communication range, thereby avoiding interference of outdated information and potential overfitting of other agent information to a certain extent.

Fig. 2 is a schematic diagram of a distributed network multi-agent learning system in a multi-agent communication method according to an embodiment of the present invention, where the distributed network multi-agent learning system includes 8 nodes (agents), A1-A8, respectively, and the nodes are connected by edges, and for a target agent A1, an agent having an edge connection with the target agent A1 includes: a2, A3, A4, the area (the area within the broken line) surrounded by the agents A2, A3, A4 is the communication range of the target agent A1.

S105, receiving privacy protection communication receiving information sent by an agent with an edge with the target agent, and updating the neural network according to the current global environment state, the decision action taken by the target agent, the state-action cost function estimation information, the privacy protection communication receiving information, the environment feedback rewarding value and the new global environment state.

In this embodiment, the reward value of the environmental feedback is a reward that can be obtained by executing the local decision action in the current global environmental state. The privacy-preserving communication reception information is privacy-preserving communication information transmitted by an agent having an edge with the target agent. As an alternative embodiment, after each acquisition of the state-action cost function estimation information, the neural network algorithm of the agent is updated with parameters to perform policy update, so as to use the updated neural network to make a local decision action in the next decision period. In this way, the state-action cost function estimation is iteratively updated according to the current state-action cost function estimation, the received privacy protection communication information, the rewarding value of environment feedback and the new global environment state, and finally a stable distributed collaborative decision strategy is obtained through training, so that the communication safety performance guaranteed by a strict theory is improved.

In this embodiment, on the basis of the QD-Learning algorithm, random noise processing is performed on the communication information, i.e., the state-action cost function estimation information, and the original communication information is replaced by the privacy protection communication information, so that the security of the networked multi-agent enhancement system is improved.

In this embodiment, the neural network is updated according to the communication information, the privacy protection communication receiving information, the reward value, the new global environment state and the current global environment state, so that under the condition that the local information and the privacy protection communication information disturbed by noise are obtained by designing an agent update strategy adapted to the privacy protection mechanism, the neural network update of the agent and the asymptotic convergence of the state-action cost function estimation information are realized, and the accuracy of the algorithm (p, r) and the e-difference privacy degree of the data are effectively ensured. As an alternative embodiment, the neural network is updated using the following formula:

Wherein the parameters satisfy the following conditions:

wherein a, b, τ ₁ E (0.5, 1), E ₁ >0 is a normal number, T _s,a (k) represents the time when the k+1st sample of a certain state-action pair (S, a) occurs in the whole random event sequence, r _i is the prize value obtained by executing the local decision action according to the prize function calculation, S' is the new global environment state, S is the current global environment state,Information is received for privacy preserving communications sent by a jth agent having an edge with the target agent,Estimating information for a state-action cost function corresponding to a next decision period for the target agent to transmit to each agent having an edge with the target agent in the next decision period,For the state-action cost function estimation information currently sent by the target agent, s _t is the current global environment state, a _t is the local decision action, a is the local decision action set obtained according to the new global environment state, a ^′ is the local decision action obtained according to the new global environment state,The state-action cost function estimation information is obtained according to the new global environment state and the local decision action obtained according to the new global environment state, and gamma is a forgetting factor and is a constant coefficient.

In this embodiment, the neural network is pre-trained, in which the sum of the consistency loss and the time-series differential loss is optimized by performing an end-to-end training using a Decentralization Training Decentralization Execution (DTDE) framework with the consistency loss and the time-series differential loss added as final optimization targets. As an alternative embodiment, the target agent updates its own neural network (policy neural network) by exchanging private communication information with neighbor agents within communication range.

In this embodiment, in the next decision period, the target agent performs the step of obtaining the local decision action based on the new global environment state and the updated neural network, and so on until all decision periods are performed.

The method of the embodiment can ensure that the state-action cost function estimation information (communication information) of each agent meets the mean square consistency and the expected consistency under the condition that the real communication information is disturbed by noise, and is described in detail below.

In the present embodiment, information is estimated for each state-action cost functionIn case the communication topology satisfies connectivity, asymptotically uniform in mean square can be achieved:

And, expected asymptotic consistency:

Wherein, In order to sum the state action cost functions of all the agents with edges with the target agent, the method of the embodiment can ensure that all the agents in the system learn the state action cost functions which are approximately consistent because the state action cost functions of the agents meet the mean square consistency and the expected consistency, thereby ensuring the coordination capacity under the distributed training distributed execution framework.

Meanwhile, the method of the embodiment can realize privacy protection performance of accuracy and recall (p, r) -accuracy and E-differential privacy.

In this embodiment, for each state-action cost function estimation information, the average value of state-action cost functions of all agents having edge connection with the target agentEstimating information relative to optimal state-action cost functionThe error of (p, r) -accuracy is satisfied:

where p is the precision, r is the recall, and the variance of the random variable Δ (t) satisfies the constraint:

wherein M _t＝(1-α_s,a(t)+γα_s,a(t))² epsilon (0, 1), Parameter settings and degree matrix of network topology depending on differential privacy noise

In this embodiment, the degree of protection of the real communication information follows the definition of the differential privacy mechanism, and may be measured by using e-differential privacy. Specifically, consider two data sets D and D' satisfying delta-adjacency, only one data point of which is different and the error value is within delta, after the pair of data sets is input into a preset random algorithm M:D→O, the output of the random algorithm satisfies the following probability relation:

P(M(D)∈O)≤exp(∈)P(M(D′)∈O)

In this embodiment, the algorithm privacy degree satisfies: Δη _i (t) is random noise sampled independently twice from the same Laplacian function distribution Lap (0, iota _i (t)) AndAnd (3) a difference.

According to the method, the communication information of the networked multi-agent reinforcement learning system is effectively privacy-protected, and the privacy and the safety of the multi-agent reinforcement learning system are improved.

In this embodiment, taking a traffic simulation environment (SUMO, simulation of Urban Mobility) as an example, by constructing a network-connected multi-vehicle automatic driving scene (Cooperative Adaptive Platoon Control, CAPC) based on a distributed network multi-agent learning system, as an alternative embodiment, the network-connected multi-vehicle automatic driving scene includes an acceleration following scene and a deceleration following scene of a single-lane vehicle queue. Each vehicle corresponds to an agent, and each agent can only receive communication information of limited neighbor agents within a communication range during training and executing stages, for example, for one of the agents, only an agent (front car and rear car) adjacent to the agent can communicate, and an area enclosed by the agent and the adjacent agent is the communication range of the agent. Next, the differential privacy parameters s _i and q _i of each agent are initialized, in this embodiment, s _i =0.01 and q _i =0.99 are initialized, sampling is performed according to the differential privacy protection mechanism, the additive noise η _i (t) is obtained, the additive noise is added to the real communication information (state-action cost function estimation information) to obtain privacy protection communication information, and the privacy protection communication information is transferred to the neighbors in the communication range.

The agent samples actions (executes local decision actions) from the current policy network (neural network) and interacts with the environment to obtain a new global environment state and rewarding value of the environment, receives privacy protection communication information sent by neighbor agents, and updates the current policy network according to the agent policy updating mode. And repeating the initialization step and the environment interaction step until the strategy network converges, and ending the training process.

To illustrate the effectiveness of this embodiment, this embodiment constructs two common collaborative autopilot scenarios: a single-lane acceleration following scene (CAPC CATCH-up) and a single-lane deceleration following scene (CAPC slow-down), the scene of the embodiment is built by adopting a SUMO traffic simulator. For parameter setting of an intelligent body (vehicle), taking a vehicle model into consideration, adopting simulation software to embed an optimal speed model (Optimal Velocity Model, OVM), wherein a state set of the vehicle comprises a distance from a front vehicle, namely h _i, a current speed v _i, a current acceleration a _i and a longitudinal control action set which is a combination formed by a front vehicle distance gain and a speed gain togetherConsider here four optimal levels { (0, 0) (0, 0.5) (0.5, 0) (0.5 ) }, a control period of 0.1s, a total control duration of 60s, and a bonus (cost) function set toCAPC the detailed scene parameters settings are as in table 0 below. Specifically, in CAPC accelerating following scenes, the running speeds and the front vehicle distances of all vehicles are randomly initialized, and the speeds of all vehicles except the leading vehicle are less than the optimal running speedAnd the front vehicle distance is larger than the optimal keeping vehicle distanceThe ideal goal is that all following vehicles can learn the cooperative strategy of speeding up and shortening the vehicle distance; in CAPC speed-reducing following scenes, randomly initializing the running speeds and the distance between the front vehicles of all vehicles, and meeting the condition that the speeds of all vehicles except the leading vehicle are greater than the optimal running speedAnd the distance between the front vehicles is slightly smaller than the optimal keeping distanceThe ideal goal is that all following vehicles learn a cooperative strategy that reduces the speed and maintains collision avoidance, and the decision strategy in this scenario is more complex due to the possibility of collisions.

Table 0: CPAC scene detail parameter settings in SUMO simulation software

Experimental scene parameter name	Parameter value
		Safety distance from front vehicle	h₁≥1m
Safe driving speed	v_i≤30m/s
		Safe acceleration	\|a_i\|≤2.5m/s²
Distance to park in OVM	h_stoo＝5m
		Full speed front vehicle distance in OVM	h_full＝35m
Crash penalty (front distance less than 1 m)	1000
		Additional penalty cost for collision	5(2h_stop-h_i,t)²

Three different degrees of differential privacy gain coefficients s _gain were set to control performance, each set of experiments statistically trained to 1,000,000 steps to obtain a cumulative prize value (Rewards) as a performance index, and the results are shown in table 1.

Table 1: prize value obtained by multi-agent learning algorithm of different degrees of noise protection in CAPC scene

Fig. 3 is a schematic performance diagram of a multi-agent communication method according to an embodiment of the present invention in a single-lane deceleration following scene, fig. 4 is a schematic performance diagram of a multi-agent communication method according to an embodiment of the present invention in a single-lane acceleration following scene, s _gain in fig. 3 and fig. 4 are respectively equal to 0, 0.1 and 0.01, and the total number of Steps (Steps) is 1, 000, 000. By the method, performance and safety of the algorithm on the multi-agent cooperation task can be remarkably improved.

According to the multi-agent communication method for networked multi-agent reinforcement learning based on differential privacy protection, through constructing a decentralized distributed networked multi-agent learning system, each agent can only exchange local information with neighbor agents in a communication range in the training and executing process, so that the high training resource requirement and single-point fault systematic risk of the centralized system are reduced, meanwhile, laplacian additive noise is added to communication information, so that real communication information cannot be stolen and restored by a malicious third party node, a multi-agent cooperative strategy with better safety, privacy and robustness is obtained, the fact that the communication information in a communication channel cannot be recovered after being intercepted by the malicious third party node can be guaranteed under DTDE learning frame, and the learning efficiency, reliability and safety of multi-agent cooperation reinforcement communication are effectively improved. Has the following obvious advantages:

1) Compared with the main stream CTDE algorithm learning framework, the embodiment has lower system risk, better expansibility and flexibility, higher learning efficiency and higher cooperation capability among the intelligent agents by constructing the DTDE algorithm learning framework.

2) The method of the embodiment can remarkably improve the privacy and the safety of the communication information of the multi-agent cooperation algorithm, avoid the risk that the communication information is intercepted by a malicious third-party node and the real information of the user is restored, and exceed the multi-agent cooperation algorithm without privacy protection function of the current mainstream in terms of data safety.

Based on the same inventive concept, as shown in fig. 5, an embodiment of the present invention further provides a multi-agent communication device, where the device includes:

A system construction module 501, configured to construct a distributed networked multi-agent learning system based on a communication relationship between agents, where the agents are execution unit nodes of the distributed networked multi-agent learning system, and the communication relationship is an edge of the distributed networked multi-agent learning system;

The state action module 502 is configured to generate a local decision action by using a Q-Learning algorithm based on a current global environment state observed by a target agent in a current decision period and state-action cost function estimation information represented by a neural network, execute the local decision action in the current global environment state, and obtain a reward value of the state-action cost function estimation information and a new global environment state from an environment;

in this embodiment, each agent performs an independent local decision action according to the current global environment state, receives its own local reward, and uses the state-action cost function estimation information of "state-action" as the real communication information.

In this embodiment, as an optional embodiment, the state action module 502 is further configured to:

And in the next decision period, the target agent executes the step of acquiring the local decision action based on the new global environment state and the updated neural network.

The noise acquisition module 503 is configured to sample based on laplace function distribution according to a preset differential privacy parameter, so as to acquire random noise information;

in this embodiment, as an optional embodiment, the noise obtaining module 503 includes:

a calculating unit configured to calculate a product of the sampling coefficient, the first differential privacy parameter value, and the second differential privacy parameter value;

And the sampling unit is used for taking zero as a position parameter of the Laplace function distribution, taking the product as a scale parameter of the Laplace function distribution, and sampling to obtain the random noise information.

In this embodiment, as an alternative embodiment, the random noise information is obtained using the following equation:

η_i(t)～Lap(0,ι_i(t))

A privacy protection module 504 configured to generate privacy-protected communication information based on the random noise information and the state-action cost function estimation information, and send the privacy-protected communication information to an agent having a side with the target agent;

In this embodiment, random noise information is added to real communication information to construct private communication information, so as to exchange private communication information with neighbor agents within a communication range.

In this embodiment, as an optional embodiment, the privacy protection module 504 includes:

a privacy information generating unit configured to add the random noise information to the state-action cost function estimation information, and generate the privacy-preserving communication information;

and the privacy information sending unit is used for sending the privacy protection communication information to the agent with the side with the target agent.

The policy updating module 505 is configured to receive privacy-preserving communication receiving information sent by an agent having an edge with the target agent, and update the neural network according to the decision action taken by the target agent, the state-action cost function estimation information, the privacy-preserving communication receiving information, the reward value, the new global environment state, and the current global environment state.

In this embodiment, as an alternative embodiment, the neural network is updated using the following formula:

Wherein the parameters satisfy the following conditions:

Based on the same inventive concept, the embodiments of the present invention also provide a storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the multi-agent communication method in any of the possible implementations described above.

Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, a ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

Based on the same inventive concept, referring to fig. 6, the embodiment of the present invention further provides an electronic device, including a memory 101 (e.g. a nonvolatile memory), a processor 102, and a computer program stored in the memory 101 and capable of running on the processor 102, where the steps of the multi-agent communication method in any of the above possible implementations are implemented by the processor 102 when the program is executed, and may be equivalent to the multi-agent communication device as before, and of course, the processor may also be used to process other data or operations. The electronic device may be a PC, server, terminal, etc.

As shown in fig. 6, the electronic device may generally further include: memory 103, network interface 104, and internal bus 105. In addition to these components, other hardware may be included, which is not described in detail.

It should be noted that the multi-agent communication device may be implemented by software, and is a device in a logic sense, and is formed by the processor 102 of the electronic device where the multi-agent communication device is located reading the computer program instructions stored in the nonvolatile memory into the memory 103 and running the computer program instructions.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely exemplary of embodiments of the present invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-agent communication method, comprising:

And receiving privacy protection communication information sent by an agent with an edge with the target agent, and updating the neural network according to the current global environment state, the decision action taken by the target agent, the state-action cost function estimation information, the privacy protection communication information, the rewarding value and the new global environment state.

2. The multi-agent communication method according to claim 1, wherein the sampling based on the laplace function distribution according to the preset differential privacy parameter to obtain random noise information includes:

calculating a product of the sampling coefficient, the first differential privacy parameter value and the second differential privacy parameter value;

And taking zero as a position parameter of the Laplace function distribution, taking the product as a scale parameter of the Laplace function distribution, and sampling to obtain the random noise information.

3. The multi-agent communication method of claim 1, wherein the generating privacy-preserving communication information based on the random noise information and the state-action cost function estimation information comprises:

and adding the random noise information and the state-action cost function estimation information to generate the privacy-preserving communication information.

4. A multi-agent communication method according to any one of claims 1 to 3, further comprising:

In the next decision cycle, the target agent performs the step of generating a local decision action based on the new global environmental state and the updated neural network.

5. A multi-agent communication device, the multi-agent communication device comprising:

And the policy updating module is used for receiving privacy protection communication information sent by an agent with an edge with the target agent, and updating the neural network according to the decision action taken by the target agent, the state-action cost function estimation information, the privacy protection communication information, the reward value, the new global environment state and the current global environment state.

6. The multi-agent communication device of claim 5, wherein the noise acquisition module comprises:

7. The multi-agent communication device of claim 5, wherein the privacy protection module comprises:

8. The multi-agent communication device of any one of claims 5 to 7, wherein the state action module is further configured to:

9. A storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the multi-agent communication method of any one of claims 1 to 4.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the multi-agent communication method of any one of claims 1 to 4 when the program is executed.