1 Introduction

The rapid development of network communication technology has dramatically promoted social progress and provided people with more prosperous and convenient life experiences. Among these advances, wireless communication technology has injected new vitality into the proliferation of intelligent applications, significantly changing people’s lives. Wi-Fi has become the most widely used technology among wireless communication technologies due to its advantages of long communication distances, high transmission rates, and fast connectivity [1]. Wi-Fi (Wireless Fidelity) technology continues to improve and evolve, constantly introducing new standards and protocols. In 2024, the IEEE released the latest Wi-Fi 7 standard (IEEE 802.11be), which aims to increase transmission rates, enhance security, and improve performance. The introduction of the Wi-Fi 7 standard has enabled a broader application of Wi-Fi technology in scenarios such as intelligent manufacturing, smart cities, smart offices, innovative education, Virtual Reality (VR), Augmented Reality (AR), and the Internet of Things (IoT). However, as the number of wireless access devices proliferates and application scenarios diversify, Wi-Fi networks face serious performance degradation challenges. In high-density device access scenarios, connecting multiple devices can cause network congestion, reducing data transmission rates and network throughput. At the same time, the services carried by Wi-Fi protocols are becoming more diverse and complex, including latency-sensitive applications such as interactive gaming and AR/VR, which place higher demands on Quality of Service (QoS). As a result, there is an urgent need to ensure that Wi-Fi networks can provide reliable and efficient services in high-density device access scenarios.

Contention window(CW) is one of the key factors affecting network performance. CW defines the time range for a device to randomly select a return channel after detecting that the channel is free. When multiple nodes attempt to access the wireless channel simultaneously, a collision occurs, resulting in the loss of all frames involved in the collision, and the nodes must retry sending after a random period, which significantly increases the data transmission time for the nodes. A larger CW reduces the chance of multiple devices choosing the same back off counter, thereby reducing the possibility of collisions caused by simultaneous transmission of data. However, a larger contention window also means increased latency for devices as they wait for transfers. Smaller CW can reduce device latency and improve network responsiveness, but it also increases the risk of multiple devices choosing the same back off counter, resulting in more transmission conflicts and lower channel.

The IEEE 802.11 protocol primarily defines standards for the physical and MAC layers. Distributed Coordination Function (DCF) serves as the primary channel access mechanism at the MAC layer, using Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) to avoid collisions by monitoring channel activity. The Wi-Fi 7 standard uses an enhanced DCF mechanism that incorporates techniques such as Orthogonal Frequency Division Multiple Access (OFDMA) modulation and time-based scheduling to better adapt to high-density, high-throughput network environments. However, due to the inherent limitations of the DCF mechanism, specific problems remain. When a large number of devices access the network at the same time, the CSMA/CA mechanism can lead to increased collisions, thereby reducing channel transmission efficiency. In addition, the DCF mechanism lacks explicit classification and priority control for different types of traffic, which can affect transmission reliability and increase delays in scenarios that require high real-time performance. IEEE 802.11e is an amendment to the Wi-Fi standard designed to enhance the QoS support of IEEE 802.11 to meet the needs of real-time and multimedia applications. One of its key components is Enhanced Distributed Channel Access (EDCA), which classifies and prioritizes multimedia traffic using MAC Layer Resource Allocation (MAC-RA) parameters. However, under varying network densities, static MAC-RA parameter values may not always be optimal. Changes in network density can lead to channel congestion or idle channel time waste. Therefore, how to intelligently and appropriately adjust MAC-RA parameter values is a crucial issue within the EDCA mechanism.

As machine learning (ML) [2,3,4] continues to be applied in various domains, it offers new approaches to address these challenges. Reinforcement learning (RL) [5, 6] is characterized by its ability to allow agents to learn optimal decision strategies by interacting with the environment, making it well-suited for dynamic and complex wireless network environments. Deep reinforcement learning (DRL) [7,8,9], which combines the feature extraction capabilities of deep learning (DL) with the decision-making capabilities of RL, can be flexibly adapted to different tasks and scenarios, providing scalability and universality. In addition, DRL algorithms can explore and learn in unknown environments, gradually improving their decision-making efficiency. DRL techniques have been successfully applied in wireless networks, adjusting the contention window (CW) based on real-time network conditions to optimize transmission efficiency and maximize network performance. With the proliferation of connected devices and the rapid growth of data traffic, traditional IEEE 802.11 standards often struggle to manage network resources and handle device competition effectively, resulting in problems such as network congestion, increased collision rates, and reduced throughput. Wydmański et al. [10] proposed a scheme called CCOD-DQN (Centralized Contention Window Optimization with DRL), which uses Deep Q-Networks (DQN) to explore the most appropriate CW values under different network conditions, thereby improving the throughput of Wi-Fi networks. To address the potential over-increasing or over-decreasing of CW values in the traditional binary exponential backoff (BEB) algorithm, Ke et al. [11] introduced a smart exponential-threshold-linear (SETL) back off algorithm. This algorithm determines the adjustment strategy of CW after each data transmission by setting a threshold value. Based on the SETL back off algorithm, Ke et al. [12] used DRL techniques to explore the optimal setting of the threshold, which further improved the performance of Wi-Fi networks. Although the aforementioned approaches significantly improve Wi-Fi network performance through deep reinforcement learning (DRL), they do not fully consider the needs of different types of traffic. For latency-sensitive applications such as real-time video and online gaming, unacceptable delays directly impact the user experience and quality of service.

EDCA prioritizes real-time sensitive traffic (such as voice and video), but under high network load, the traffic contention across all ACs intensifies, leading to increased transmission delays and packet loss for real-time traffic. Lower-priority traffic may experience frequent transmission failures or excessive retries, resulting in prolonged periods without successful transmission, severely impacting the user experience. The root of this issue lies in the CW range of ACs within the EDCA mechanism and the traditional back off algorithms, which struggle to effectively adapt to varying network load conditions. The CW compensation value is a dynamic adjustment mechanism that adapts the CW range of each AC based on the current network conditions. When the network is idle or under low load, this CW compensation value adjusts the AC CW range to remain smaller, allowing real-time traffic to transmit more promptly, thereby improving efficiency. Under heavy network load and high contention, the CW compensation value expands the AC CW range, increasing the back off time for stations, reducing collisions, and ensuring stable data transmission and throughput. For example, in a public Wi-Fi network at an airport, where a large number of users are typically connected simultaneously, during early morning or late-night hours when only a few passengers are using Wi-Fi in the terminal, the intelligent AP can use the CW compensation value to reduce the CW range of ACs. This allows each user’s device to access the channel more quickly, providing a low-latency browsing experience. During peak flight arrival times, when many passengers are using Wi-Fi simultaneously before boarding, the intelligent AP increases the CW range of ACs through the CW compensation value, preventing too many devices from attempting to send data packets at the same time, thus reducing collisions and ensuring a better user experience for services like voice and video. Thus, by dynamically adjusting the CW range of ACs through the CW compensation value, the network can adapt to different load conditions. The core goal is to reduce wait times and increase throughput under low network load, while minimizing collisions and maintaining stable network performance under heavy load, ultimately optimizing the user’s Wi-Fi experience.

Therefore, this paper investigates the optimal MAC-RA parameters in the EDCA mechanism under different device access densities using DRL technology and proposes an adaptive CW back off scheme that differentiates network traffic. The main contributions are as follows:

  1. (1)

    A PDCF-DRL scheme is proposed to alleviate the problem of increased collision rate and decreased throughput in high-density station environments due to station competition and differentiate services according to different network traffic to better meet user experience expectations.

  2. (2)

    Utilize deep reinforcement learning to sense the state of the wireless network and explore a CW compensation value to intelligently adjust the AC’s CW range, thereby better adapting to different network load conditions.

  3. (3)

    A new adaptive back off strategy is proposed that discriminates network conditions using CW thresholds and adopts different back off strategies to optimize the data transmission process.

The remainder of this paper is organized as follows: Sect. 2 reviews the background and related works. Section 3 discusses the application of DRL technology in IEEE 802.11. Section 4 presents the proposed PDCF-DRL scheme. Section 5 presents simulation experiments and analysis. Section 6 presents the conclusions of this paper.

2 Background review and related work

2.1 Background review

2.1.1 Distributed coordination function mechanism

The IEEE 802.11 protocol defines two types of media access control mechanisms, one of which is the contention-based Distributed Coordination Function (DCF). Before a station sends data, it first listens to the channel. If the channel is idle after waiting for one Distributed Inter-Frame Space (DIFS) time, stations generate a back off counter. During the back off process, each station monitors the channel. If the channel remains idle after one time slot, the back off counter value of stations decreases by one slot time. However, if there is data transmission ongoing in the channel, the back off counter of that station is frozen. The station will wait for a DIFS time, and if the channel is still idle, the station will resume the back off process. DCF defines two data transmission modes: the basic mode and the Request to Send/Clear to Send (RTS/CTS) mode. To reduce the probability of collisions caused by node contention, the binary exponential backoff (BEB) algorithm is used. When a node enters the back off process, it starts a back off counter and randomly selects an integer within the range [0, CW] as its initial value, where CW is the dynamically changing CW size, defined between the Minimum Contention Window (\(CW_{min}\)) and the Maximum Contention Window ((\(CW_{max}\)). The back off counter decreases by one after each idle time slot. When the counter reaches 0, the node sends the data frame. If two or more counters reach 0 simultaneously and a collision occurs after data transmission, the CW value is updated as shown in Formula 1, and the back off counter is reinitialized.

$$\begin{aligned} CW=min(2\times CW,\text {CW}_{\text {max}} ) \end{aligned}$$
(1)

2.1.2 Enhanced distributed channel access mechanism

With the development and application of networking technology, the demand for mobility and reliability in networks has increased. To support QoS for different traffic flows at the MAC sublayer, the IEEE 802.11e working group proposed the EDCA mechanism, which is based on the DCF mechanism of IEEE 802.11. EDCA aims to better meet the service requirements of multimedia traffic in wireless networks. EDCA classifies traffic into eight Traffic Categories (TCs), which are mapped into four access categories (ACs): Voice, Video, Best-effort, and Background. Each channel thus defines four different access categories, as shown in Table 1. To distinguish between the four ACs with different priorities, four parameters are defined: Arbitration Inter-Frame Space (AIFS), \(CW_{min}\),\(CW_{max}\), and Transmission Opportunity (TXOP), as shown in Table 2. Each AC controls its channel access process through different parameter settings, thus achieving differentiation between different types of traffic.

Table 1 Access categories and traffic categories
Table 2 Access categories and related parameters

The basic access mechanism of EDCA is illustrated in Fig. 1. Before transmitting data, each AC queue at the station listens to the channel. If the channel is idle during the listening period and remains idle for the duration of AIFS[AC], the AC can directly begin the back off process. If the channel is busy, it continues to listen until the channel remains idle for AIFS[AC], then independently starts the backoff process. The calculation method of AIFS[AC] is shown in Formula 2, where ST is a slot time and AIFSN is a positive integer related to the AC, with different ACs having different values. A smaller AIFS value means that a station can access the channel more quickly. After waiting for an AIFS[AC], each AC queue randomly generates a backoff counter (Backoff[AC]) to start the backoff process. The calculation of the Backoff[AC] value is shown in Formula 3, where the current CW value \(CW_{cur}\)[AC] initially equals \(CW_{min}\)[AC]. If the sending station does not receive an ACK frame from the receiving station within a certain period, it considers a collision and needs to resend the data. When retransmitting data, the value of \(CW_{cur}\)[AC] doubles for delayed backoff. If \(CW_{cur}\)[AC] increases to \(CW_{max}\)[AC], it will not increase further. If the channel is busy before the backoff counter reaches 0, the counter freezes, and counting pauses until the channel becomes idle for AIFS, then continues decrementing. If the backoff counter of an AC queue reaches zero, the AC will enter the transmission process and contend for a TXOP. Upon obtaining a TXOP, a station can transmit multiple frames continuously within the \(TXOP_{limit}\) time. After each successful frame transmission (i.e., receiving an ACK), the station can directly send the next frame after a short interframe space (SIFS) delay without needing to contend for the channel after each frame transmission. This helps improve channel utilization and network throughput, while reducing waiting latency. Similarly, TXOP varies for different AC queues. When TXOP is set to 0, it means that the station can only send one frame at a time.

$$\begin{aligned} AIFS[AC]= & AIFSN[AC]\times {ST}+SIFS \end{aligned}$$
(2)
$$\begin{aligned} Backoff[AC]= & uniform(0,CWcur[AC]-1)\times {ST} \end{aligned}$$
(3)
Fig. 1
figure 1

Enhanced distributed channel access mechanism

2.1.3 Motivation

With the rapid development of smart devices, the number of devices connected to WLAN(Wireless Local Area Networks) in homes is increasing. Smart TVs, smart speakers, and smart home appliances are placing higher demands on WLAN performance, leading to a trend of high-density WLAN access devices in households. In public places, such as airports and offices, WLANs are also being densely deployed to provide affordable internet access. As WLAN devices are deployed more and more densely, service types become more and more complex, and service traffic becomes larger and larger, the number of STAs and traffic carried by a single AP is also particularly large. Consequently, the collision rate of the network rises sharply and the throughput decreases, which greatly affects the user’s daily experience.

2.2 Related work

In the fields of Wireless Local Area Networks (WLAN), Wireless Sensor Networks (WSN), and the Internet of Things (IoT), many experts and scholars have devoted themselves to improving the DCF mechanism to enhance network performance and transmission efficiency. Their research and contributions in this area have made significant progress.

In WLAN, Wi-Fi technology is used to connect devices within a limited geographical area, such as laptops, smartphones, and tablets. In 2016, IEEE published the IEEE 802.11ah standard for industrial IoT. One of the core MAC mechanisms to address collision issues in dense wireless networks is the Restricted Access Window (RAW). Cheng et al. [13] proposed the Channel-Aware Contention Window Adaptation (CA-CWA) algorithm, which dynamically adjusts the CW during each RAW period when the DCF mechanism is required to contend for the channel, thereby improving the real-time performance of the IEEE 802.11ah standard. In 2019, the IEEE 802.11ax standard introduced a number of features to improve spatial reuse. The efficiency of how base stations select appropriate modulation and coding schemes based on different transmission conditions is closely related to the efficiency of spatial reuse features. Krotov et al. [14] proposed a new statistics-based rate control algorithm that takes into account the effect of the current spatial reuse features when selecting modulation and coding schemes, which significantly improves the throughput and effectively reduces latency, further improving the transmission efficiency of the network. Although these traditional enhancements improve WLAN performance to some extent, they tend to result in long delays and high collision rates in high-density access scenarios and have difficulty adapting to dynamic network conditions.

The advent of ML has provided new approaches to solving these problems. Sandholm et al. [15] proposed a machine learning-based contention window control strategy to control Wi-Fi contention windows on access points (APs) in dense environments, significantly improving Wi-Fi network throughput. Although the current Wi-Fi 6 technology can handle large amounts of data, its CSMA/CA mechanism lacks scalability and struggles to maintain stable throughput in dense scenarios. Chen et al. [16] proposed a deep learning-based contention window control strategy that uses deep learning to search for the optimal configuration of CW under different network conditions, which significantly improves Wi-Fi 6 in terms of system throughput, average transmission delay, and packet retransmission rate, while better adapts to the access of numerous IoT devices. DRL combines the feature extraction capabilities of DL with the decision capabilities of RL, providing good scalability and generality. Wydmański et al. [10] proposed a deep reinforcement learning-based contention window control strategy that uses DQN (Deep Q-Network) to explore the most appropriate CW values under different network conditions, thereby improving the throughput of Wi-Fi networks. Based on this, Ke et al. [12] further differentiated network conditions by using the DQN algorithm to explore a CW threshold to distinguish network load conditions and learn the optimal settings for different network situations. This differentiated CW adjustment strategy further improves Wi-Fi network throughput and reduces collision rates.

In summary, current WLAN research is primarily focused on dynamic adaptation of CW, optimization of transmission rates, and modification of other channel access-related parameters. While these improvements enhance network performance under general conditions, practical applications often require support for multiple network services simultaneously. In meeting these diverse service requirements, a homogeneous scheduling strategy may not be sufficient, potentially resulting in inadequate bandwidth allocation or excessive delays for certain services. In addition, it may be more difficult to ensure the QoS for some real-time services (such as video conferencing and online gaming).

Wireless sensor networks (WSNs) consist of sensor nodes distributed over a spatial area. These nodes typically have sensing, processing, and communication capabilities and are used to monitor various physical parameters in the environment. In critical areas such as healthcare or industrial control, ensuring the accuracy and real-time nature of the data is essential to making the right decisions, so the QoS requirements are more stringent. To improve the performance of WSNs, Li et al. [17] proposed a Global View-based Adaptive Contention Window (GV-ACW) MAC protocol. GV-ACW adopts the optimized size of contention window in the near sink area to meet the functional requirements of data forwarding, while in the far sink area, the size of contention window is larger than it is required by node for data transmission so as to achieve a compromise between energy consumption (i.e., alternative energy harvesting) and delay and thereby improve the network performance as a whole. Collisions between sensor nodes cause message delays, data loss, and retransmissions, which consume more energy. To mitigate collisions during transmission, Ghimire [18] proposed a new Energy-efficient Collision Mitigation MAC (ECM-MAC) protocol to address the QoS and energy constraints of various applications. ECM-MAC uses different CW and node priority mechanisms to achieve overall network performance. Nodes with the most critical data or the least remaining energy are prioritized for transmission, thereby minimizing the likelihood of collisions and maximizing network lifetime. In high collision scenarios, it intelligently reduces data transmission traffic to minimize collisions.

With the development of IoT and industrial networks, new challenges and issues have emerged in terms of adaptability, real-time performance, reliability, and energy efficiency. Le [19] proposed an efficient Backoff Priority-based Medium Access Control (BoP-MAC) scheme that supports multi-priority data and uses compensation mechanisms and backoff priorities to correctly adjust CW sizes, ensuring timely and reliable transmission of high-priority data. Wireless Body Area Networks (WBANs) consist of an array of energy-efficient miniature sensors designed to monitor human health conditions, facilitating the early detection and treatment of life-threatening diseases. To meet the stringent requirements of WBANs for energy efficiency, reliability, and low latency, Ranjan et al. [20] proposed the Priority and Contention Control (PCC) algorithm, which aims to minimize collisions, delays, and energy consumption. This algorithm prioritizes sensors based on runtime metrics and predicts the CW size based on the current channel state, queue length, and collision rate. By dynamically predicting the CW, the algorithm minimizes channel access delay, collisions, and energy consumption, thereby improving overall network performance.

In WBANs, due to the limitations of CSMA/CA, multi-node channel contention leads to high data latency, which can have severe consequences for delay-sensitive applications. To improve the response speed to emergency and alarm events in WBANs, Hussain [21] proposed a scheme that considers the level of emergency or message priority by associating the CW with the level of the burst event. For high-priority services, channel detection is performed only once, while for low-priority services, it is performed twice. The algorithm’s parameter settings are analyzed using a Markov chain model to ensure fast response to events. ML has played an essential role in addressing the complexities and challenges of WSNs by providing intelligent, adaptive, and efficient solutions for networks. Kwon et al. [22] proposed a Reinforcement Learning-based Contention Window Adjustment (RL-CWA) method for WBANs, which uses Q-Learning to adaptively adjust the contention window based on received ACKs, thereby reducing the number of collisions. In addition, RL-CWA maintains multiple Q-tables to account for the varying minimum and maximum contention windows depending on the user priority (UP) of the traffic. This machine learning-based approach significantly improves WBAN performance, making it more suitable for complex sensor network environments and application requirements.

Although researchers have proposed various optimization algorithms and solutions to ensure QoS in WSNs, many studies focus on small-scale WSNs in specific application scenarios, such as WBANs, which typically contain fewer sensor nodes. In high-density node environments, the network topology is more complex and may face more interference, collisions, and other challenges. Ensuring the QoS of different services in high-density device access scenarios is a complex task. To overcome the problems of traditional DCF mechanisms, which suffer from a lack of business differentiation and poor performance in high-load network scenarios, this paper proposes a novel PDCF-DRL approach. This approach introduces a CW compensation mechanism on top of the existing EDCA mechanism. By leveraging DRL technology, it dynamically perceives the current network conditions to explore appropriate CW compensation values, thereby updating the CW sizes corresponding to different access categories (ACs). In addition, a new adaptive CW adjustment algorithm is proposed to effectively reduce the probability of collisions, increase the overall network throughput, better meet the QoS requirements of different enterprises, and improve the overall user experience.

3 The application of DRL in IEEE 802.11

3.1 Partially observable Markov decision process

MDP (Markov Decision Process) is a mathematical framework used to describe decision problems in stochastic environments. MDP consists of elements such as state space, action space, state transition probabilities, reward function, etc., where state transitions and rewards exhibit the Markov property, meaning the next state and reward depend only on the current state and the action taken. Partially Observable Markov Decision Process (POMDP) is an extension of MDP that considers the presence of incomplete observations during the decision process. In addition to the basic elements of MDP, POMDP includes an observation space and observation probabilities. In POMDP, agents cannot directly observe the complete state of the environment but make decisions based on partially observed information. Due to the dynamic and complex nature of wireless networks, obtaining direct access to network states is challenging. Therefore, using POMDP to model CW optimization problems can more accurately reflect real-world situations. Agents need to consider uncertainty and partial observations when making decisions in order to formulate adaptive strategies with long-term rewards. POMDP can be defined as a 7-tuple (S,A,T,R,\(\Omega\),O,\(\gamma\)), specifically defined as follows:

The agent is located at a wireless access point (AP) with a global view of the entire network. By applying POMDP, the agent (AP) can effectively handle incomplete observations, make appropriate decisions based on the perceived network conditions, and determine the optimal CW compensation value (\(CW_{comp}\)) to adjust the CW sizes for different ACs, thereby maximizing network performance. Once the appropriate \(CW_{comp}\) is determined, the AP broadcasts it to all stations via beacon frames, thereby achieving collaborative optimization of CW.

S represents the state space set (\(s \in S\) ), indicating the exact states of all devices in the network. In POMDP, the agent cannot determine its environment’s state but can obtain incomplete information about the state through observation, relying on the history of observations and actions to make decisions.

A represents the action space set (\(a \in A\) ), indicating the possible actions that can be taken in the current state, which is used to adjust the size of \(CW_{comp}\). Where \(a \in [0, 7]\), thus, the range of \(CW_{comp}\) is from 0 to 896.

$$\begin{aligned} CW_{comp}=2^7 \times a \end{aligned}$$
(4)

T represents the state transition probabilities, indicating the probability distribution of transitioning from the current state (s) to another state (\(s'\)) by taking action (a).

R represents the reward function, indicating the reward obtained for taking action in the current state. Since throughput directly reflects network performance, throughput is chosen as the reward function. Considering that the rewards in DRL should be between 0 and 1. Therefore, throughput needs to be normalized by using the current throughput of the network (the number of frames successfully received) divided by the desired maximum throughput, as shown in Formula 5, where TPS represents normalized throughput, \(N_r\) represents the correctly received frames within a transmission period \(T_{period}\), and V represents the channel transmission rate.

$$\begin{aligned} TPS = \frac{\text {bit}(N_r)}{V \times T_{\text {period}}} \end{aligned}$$
(5)

\(\Omega\) represents the observation space set, and O represents the set of conditional observation probabilities. When the agent takes action \(a \in A\), it receives observation \(o \in \Omega\), and this observation depends on the new state of the environment, with its probability represented as \(O(o \mid s', a)\). In this scheme, we define the historical collision rate H(\(P_{col}\)) as the observation o, where \(P_{col}\) represents the probability of unsuccessful transmission, i.e., the collision rate, defined as in Formula 6, where \(N_t\) represents the total frames transmitted by stations and \(N_r\) represents the correctly received frames. In an 802.11 network, the AP can know the total number of frames \(N_t\) sent by the stations and can obtain the number of successfully received frames \(N_r\) to calculate \(P_{col}\). i.e.,In an IEEE 802.11 network, multiple stations connect to an AP and transmit data. The AP can monitor all frames sent over the channel, including data frames, management frames, and control frames. When a station sends a frame, the AP detects the arrival of each frame. Regardless of whether the frame is successfully received, the AP updates its internal frame counter to record the presence of that frame, allowing it to track the \(N_t\). If the AP successfully receives a frame, it sends an ACK (acknowledgment) frame to the sender, indicating successful reception. By recording the number of ACK frames sent, the AP can determine the \(N_r\).The historical collision rate H(\(P_{col}\)) is a tuple that stores the collision rate of the previous state and the collision rate of the current state. By comparing H(\(P_{col}\)), we can determine whether the packet collision rate is increasing or decreasing at any given time. If the collision rate is increasing, the parameters can be adjusted and optimized again.

$$\begin{aligned} P_{col} = \frac{N_t - N_r}{N_t} \end{aligned}$$
(6)

\(\gamma \in [0,1]\) represents the discount factor used to measure the agent’s consideration of future rewards. When \(\gamma\) is close to 1, the agent considers future rewards important; conversely, when \(\gamma\) is close to 0, the agent focuses more on current rewards.

3.2 Deep reinforcement learning

In WLAN networks, the contention window is a critical parameter that determines the amount of time devices must wait before sending data, directly impacting network performance. Traditional contention window optimization methods typically rely on predefined rules and static parameters to adjust the contention window range. While these methods improve network performance to a certain extent, they have some distinct limitations. First, traditional methods lack adaptability, making it difficult to flexibly adjust the contention window range based on real-time network conditions and traffic demands. Continuous changes in network conditions can result in inefficient or overfitting performance. Second, traditional methods struggle to meet diverse requirements, as different application scenarios and quality of service demands may require different contention window strategies. Reinforcement Learning (RL) offers a potential solution to these problems by allowing the agent to learn the optimal decision strategy through interaction with the environment. Deep Q-Network (DQN) is a representative algorithm of deep reinforcement learning that introduces experience replay and target networks to address stability and convergence issues, making Q-learning closer to the level of supervised learning. The interaction process between the agent and the environment in DQN is shown in Fig. 2. The agent first evaluates the current state of the environment and uses it as input to a deep neural network (DNN). Based on the evaluation of the current state of the environment, the agent uses the DNN to determine a policy. The agent then executes actions based on the policy output from the DNN. The environment provides feedback on the agent’s actions by evaluating their effectiveness through reward signals. The agent updates its policy based on the received reward signals to better adapt to the environment in future interactions. This process repeats iteratively, with the agent gradually improving its policy through trial and error, aiming to maximize cumulative rewards in the evolving environment. Therefore, applying deep reinforcement learning to Wi-Fi networks enables the agent to learn to adjust the contention window ranges dynamically for different ACs based on real-time network conditions, ultimately improving network performance.

Fig. 2
figure 2

Interaction process of deep Q-network (DQN)

4 Proposal of a PDCF-DRL scheme

4.1 PDCF-DRL scheme

The EDCA mechanism distinguishes between different types of services. High-priority traffic flows are assigned shorter AIFS times and smaller CW sizes, allowing them to access the channel more quickly and thus improving the real-time performance and reliability of their data transmission. However, the channel access parameters in EDCA are static, with each AC having fixed CW sizes (\(CW_{min}\)[AC] and \(CW_{max}\)[AC]). These parameters do not take into account current network conditions or the level of network congestion. As the number of wireless access devices proliferates and traffic increases, high-priority traffic flows face increased collision probabilities due to their smaller CW sizes. In addition, the EDCA mechanism resets each station’s CW value to \(CW_{min}\)[AC] after each successful transmission, further increasing the likelihood of collisions. To improve the performance of wireless networks while ensuring the QoS of various network services, this paper proposes a novel DQN-based contention window adjustment scheme called the PDCF-DRL scheme. This scheme dynamically adjusts the \(CW_{min}\)[AC] and \(CW_{max}\)[AC] values in EDCA and introduces a new adaptive backoff strategy. It effectively distinguishes between different types of service and ensures their QoS, thereby reducing collision rates and maximizing throughput.

The PDCF-DRL scheme introduces a parameter dynamic adjustment mechanism that dynamically adjusts the CW size of each AC to adapt to different levels of network congestion by using DRL technology to monitor the network status and load conditions in real-time. The adjustment of the CW range of each AC is realized by the contention window compensation value (\(CW_{comp}\)) proposed in this paper, and the adjustment rule is shown in Table 3. The \(CW_{comp}\) value is explored by the agent in the DRL framework during each transmission period by sensing the network status. The range of the \(CW_{comp}\) value is [0, 894]. When the number of competing stations is small and the network load is light, the agent will choose a smaller \(CW_{comp}\) value to reduce the transmission delay and improve the transmission efficiency. Conversely, when the number of competing stations is high and the network load is heavy, the agent will select a larger \(CW_{comp}\) value to reduce collision rates and maximize throughput. Therefore, the PDCF-DRL scheme can adaptively adjust the CW range of each AC based on real-time network conditions, ensuring the QoS of high-priority traffic flows and achieving better network performance.

Table 3 Contention window ranges for ACs in the PDCF-DRL scheme

The PDCF-DRL scheme optimizes the backoff process of EDCA by employing different backoff strategies based on the congestion state of the channel. It introduces a CW threshold (\(CW_{thr}\)[AC]) to distinguish between different network conditions. According to the adaptive fair EDCF algorithm described in the literature [23], the \(CW_{thr}\) for each AC queue is calculated using a function as shown in Formula 7, where \(CW_{thr}\)[AC] represents the contention backoff threshold for the AC queue, \(CW_{min}\)[AC] and \(CW_{max}\)[AC] denote the minimum and maximum CW values for the AC queue,\(CW_{cur}\)[AC] is the current CW value of the AC queue, Backoff[AC] is a random value uniformly distributed in the range (0, \(CW_{cur}\)[AC]) that is stored in the backoff counter and calculated as shown in Formula 2. In the EDCA mechanism, whenever an AC queue fails to transmit due to a collision, its \(CW_{cur}\)[AC] will increase. Therefore, the calculation of \(CW_{thr}\)[AC] in Formula 7 estimates the current network congestion level by comparing the difference between \(CW_{cur}\)[AC] and \(CW_{max}\)[AC]. The denominator (\(CW_{max}\)[AC]- \(CW_{min}\)[AC]) is used for normalization to ensure that \(CW_{thr}\)[AC] remains within a reasonable range. In addition, considering the randomness of the backoff process, the formula includes a factor Backoff[AC]/\(CW_{thr}\)[AC], which represents the ratio of the random value in the backoff counter to the current CW value. \(CW_{min}\)[AC] is used as a baseline for calculating \(CW_{thr}\)[AC] for different AC queues to ensure the lower limit of \(CW_{thr}\)[AC].

$$\begin{aligned} \text {CW}_{\text {thr}}[\text {AC}] = \frac{(\text {CW}_{\text {max}}[\text {AC}] - \text {CW}_{\text {cur}}[\text {AC}])}{(\text {CW}_{\text {max}}[\text {AC}] - \text {CW}_{\text {min}}[\text {AC}])} \times \frac{\text {Backoff}[\text {AC}]}{\text {CW}_{\text {cur}}[\text {AC}]} \times \text {CW}_{\text {min}}[\text {AC}] \end{aligned}$$
(7)

The PDCF-DRL scheme proposes a new adaptive backoff strategy that distinguishes the current network conditions based on the CW threshold and executes corresponding backoff strategies to avoid collisions and reduce waiting delays. It differentiates the network conditions into two scenarios by comparing the current CW value of the AC queue (\(CW_{cur}\)[AC]) with the CW threshold of the AC queue (\(CW_{thr}\)[AC]):

  1. (1)

    If \(CW_{cur}[AC] \le CW_{thr}[AC]\), this indicates good network conditions with low channel congestion. In this case, the backoff strategy shown in Formula 8 is used. If the data transmission in the AC queue is successful, its CW value (\(CW_{cur}\)[AC]) is set to the minimum CW value (\(CW_{min}\)[AC]) to reduce the CW quickly, thereby reducing the waiting time and improving the transmission efficiency. If data transmission in the AC queue fails, its CW value (\(CW_{cur}\)[AC]) is set to the previous CW value plus 32 to slowly expand the CW to avoid further collisions and reduce transmission delays.

    $$\begin{aligned} \text {CW}_{\text {cur}}[\text {AC}] = {\left\{ \begin{array}{ll} \text {CW}_{\text {min}}[\text {AC}] & \text {if success} \\ \text {CW}_{\text {cur}}[\text {AC}] + 32 & \text {if collision} \end{array}\right. } \end{aligned}$$
    (8)
  2. (2)

    If \(CW_{cur}\)[AC] > \(CW_{thr}\)[AC], this indicates poor network conditions with high channel congestion. In this case, the backoff strategy shown in Formula 9 is used. If the data transmission in the AC queue is successful, its CW value (\(CW_{cur}\)[AC]) is set to half of the previous CW value, instead of the minimum CW value (\(CW_{min}\)[AC]) as in the previous EDCA mechanism, to prevent too small a CW from increasing the probability of collisions. If the data transmission in the AC queue fails, its CW value (\(CW_{cur}\)[AC]) is set to twice the previous CW value, which rapidly increases the CW value exponentially to reduce the likelihood of collisions.

    $$\begin{aligned} \text {CW}_{\text {cur}}[\text {AC}] = {\left\{ \begin{array}{ll} \frac{\text {CW}_{\text {cur}}[\text {AC}]}{2} & \text {if success} \\ 2 \times \text {CW}_{\text {cur}}[\text {AC}] & \text {if collision} \end{array}\right. } \end{aligned}$$
    (9)

    After the CW value of the AC is adjusted, it must be validated to ensure that it does not exceed the value specified by IEEE 802.11. If the adjusted CW value is less than the minimum CW value of the AC (\(CW_{min}\)[AC]), it should be reset to \(CW_{min}\)[AC]; if it is greater than the maximum CW value of the AC (\(CW_{max}\)[AC]), it should be reset to \(CW_{max}\)[AC], as shown in Formula 10.

    $$\begin{aligned} \text {CW}_{\text {cur}}[\text {AC}] = {\left\{ \begin{array}{ll} \text {CW}_{\text {min}}[\text {AC}] & \text {if } \text {CW}_{\text {cur}}[\text {AC}] < \text {CW}_{\text {min}}[\text {AC}] \\ \text {CW}_{\text {max}}[\text {AC}] & \text {if } \text {CW}_{\text {cur}}[\text {AC}]> \text {CW}_{\text {max}}[\text {AC}] \end{array}\right. } \end{aligned}$$
    (10)

4.2 Implementation of the PDCF-DRL scheme

The proposed PDCF-DRL scheme is a deep reinforcement learning-based EDCA mechanism that differentiates network services to ensure QoS for different types of traffic while improving network throughput and reducing collision rates under heavy load conditions. The scheme consists of two key processes: the channel access process and the reinforcement learning process. In the channel access process, the PDCF-DRL scheme distinguishes the traffic types and assigns appropriate access parameters to different ACs to ensure the real-time and accuracy of high-QoS traffic. The scheme employs an adaptive backoff strategy proposed in this paper, which uses a CW threshold to differentiate the level of channel congestion and selects the appropriate backoff strategy to optimize the data transmission process. In the reinforcement learning process, the PDCF-DRL scheme uses deep reinforcement learning to sense network conditions. It adjusts the CW compensation value based on the network load conditions and updates the contention window ranges for each AC queue accordingly. This allows the scheme to adapt to varying network conditions, maximizing network throughput and reducing collision rates.

4.2.1 Implementation of the channel access process

As wireless services diversify and traffic volumes continue to grow, distinguishing between different types of network traffic and providing differentiated QoS is a critical goal for next-generation WLANs. In this scheme, delay-sensitive traffic (video and voice) requires low latency to ensure the timely delivery of critical data to applications. Delay-sensitive data packets are given higher processing priority. Non-delay-sensitive traffic (background and best-effort) can tolerate higher delays without significantly impacting the user experience. Non-delay-sensitive data packets are given lower processing priority.

When multiple stations with different service types access the channel for data forwarding, different ACs and corresponding access parameters are assigned based on the service type of each station. The AP acts as an agent in the reinforcement learning process, periodically broadcasting the explored CW compensation value (\(CW_{comp}\)) to all stations via beacon frames. This adjustment of the CW size for each AC queue helps to better adapt to the current network conditions. The algorithm flow for the Station Channel Access is shown in Algorithm 1. The input is the \(CW_{comp}\) explored during the reinforcement learning process, and the output is the historical collision rate (H(\(P_{col}\))) and throughput (TP) obtained by applying the \(CW_{comp}\) in a transmission period. In Algorithm 1, lines 4 and 5 detail the listening process, where each AC (Access Category) queue monitors the channel and waits for the designated AIFS period. Lines 6 through 12 outline the backoff process. During this stage, each AC queue calculates a backoff value, monitors the channel for idleness, and decrements its backoff value accordingly. When an AC queue’s backoff value reaches zero, it enters the transmission phase, transmitting its data. After the transmission is complete, the AC queue adjusts its contention window and generates a new backoff value for subsequent transmissions. Upon completion of data transmissions by all queues, the overall collision rate and throughput are evaluated and recorded.

In the channel access process, stations update the CW size for each AC queue after receiving the \(CW_{comp}\) broadcast from the AP. When stations with different service types begin to compete for the channel to transmit data, they first check to see if the channel is idle. If the channel is idle for the AIFS[AC] period, the station with the AC queue starts the backoff process. After the backoff process is completed, it proceeds to the transmission process. Using its corresponding CW size (\(CW_{min}\)[AC], \(CW_{max}\)[AC]), the current CW value (\(CW_{cur}\)[AC]), and the random value of the backoff counter (Backoff[AC]), the station calculates the CW threshold (\(CW_{thr}\)) for the AC queue using Formula (7). The station then distinguishes different network conditions based on the obtained \(CW_{thr}\) and adopts the appropriate adjustment strategy. The two specific transmission stages are described below:

  1. (1)

    Successful Transmission Stage: If a station successfully accesses the channel and transmits data, the station will compare the time it has occupied the channel to the \(TXOP_{limit}\)[AC]. If the time has not reached the \(TXOP_{limit}\)[AC], the station can immediately send another frame after a SIFS without needing to compete for the channel again. However, if the time has reached the \(TXOP_{limit}\)[AC] or if the \(TXOP_{limit}\)[AC] is set to 0, the station stops occupying the channel and adopts an appropriate strategy to modify its CW based on the comparison between \(CW_{cur}\)[AC] and \(CW_{thr}\)[AC], as shown in Formula 11. If \(CW_{cur}[AC] \le CW_{thr}\)[AC], it indicates that there are fewer competing stations in the network and the channel congestion is low. In this case, the backoff strategy of the traditional EDCA mechanism is used, and \(CW_{cur}\)[AC] is set to \(CW_{min}\)[AC] to reduce the station’s waiting delay. If \(CW_{cur}[AC]> CW_{thr}\)[AC], it indicates that there are more competing stations in the network, and channel congestion is severe. Therefore, \(CW_{cur}\)[AC] is halved to mitigate potential collisions.

    $$\begin{aligned} \text {CW}_{\text {cur}}[\text {AC}] = {\left\{ \begin{array}{ll} \frac{\text {CW}_{\text {cur}}[\text {AC}]}{2} & \text {if } \text {CW}_{\text {cur}}[\text {AC}]> \text {CW}_{\text {thr}}[\text {AC}] \\ \text {CW}_{\text {min}}[\text {AC}] & \text {if } \text {CW}_{\text {cur}}[\text {AC}] \le \text {CW}_{\text {thr}}[\text {AC}] \end{array}\right. } \end{aligned}$$
    (11)
  2. (2)

    Conflict Collision Stage: If a station fails to transmit data, it compares \(CW_{cur}\)[AC] with \(CW_{thr}\)[AC] to adopt an appropriate strategy to modify the station’s CW, as shown in Formula 12. \(CW_{cur}[AC] \le CW_{thr}\)[AC], it indicates that there are fewer competing stations in the current network and channel congestion is low. In this case, the CW value is slowly increased to avoid further collisions and reduce transmission delays. If \(CW_{cur}[AC]> CW_{thr}\)[AC], it indicates that there are more competing stations in the current network and channel congestion is severe. \(CW_{cur}\)[AC] is then doubled to rapidly increase CW and reduce the collision probability.

    $$\begin{aligned} \text {CW}_{\text {cur}}[\text {AC}] = {\left\{ \begin{array}{ll} \text {CW}_{\text {cur}}[\text {AC}] \times 2 & \text {if } \text {CW}_{\text {cur}}[\text {AC}]> \text {CW}_{\text {thr}}[\text {AC}] \\ \text {CW}_{\text {cur}}[\text {AC}] + 32 & \text {if } \text {CW}_{\text {cur}}[\text {AC}] \le \text {CW}_{\text {thr}}[\text {AC}] \end{array}\right. } \end{aligned}$$
    (12)

    At the end of a transmission period, the collision rate is calculated by using the number of successfully transmitted frames (\(N_r\)) and the total number of frames (\(N_t\)). The historical collision rate (H(\(P_{col}\))) is then obtained, where the last(\(P_{col}\)) represents the collision rate from the previous transmission period. The standardized throughput (TP) is obtained by dividing the number of successfully transmitted bits by the maximum expected received bits. The historical collision rate (H(\(P_{col}\))) and the standardized throughput (TP) are used as the state (s) and the reward (r), respectively, and are input into the reinforcement learning process to adjust and optimize the learning strategy.

Algorithm 1
figure a

The Station Channel Access in the PDCF-DRL Scheme

4.2.2 Implementation of the reinforcement learning process

In the reinforcement learning process, the primary goal is to predict and adjust the CW compensation value (\(CW_{comp}\)) using DRL techniques. By applying a POMDP, the agent (AP) can effectively handle situations with incomplete observations and make appropriate decisions based on the perceived network conditions to determine the optimal \(CW_{comp}\). Once, the \(CW_{comp}\) is determined, the agent (AP) broadcasts the \(CW_{comp}\) to all its connected stations and evaluates the current \(CW_{comp}\) network performance for that transmission period. It calculates the reward r (normalized throughput) and the observation o (historical collision rate H(\(P_{col}\))). Through continuous state observation, action selection, and reward feedback, the DQN can gradually learn an optimal \(CW_{comp}\) adjustment strategy to maximize the reward function, thereby improving overall network performance. The model diagram of the proposed PDCF-DRL scheme is shown in Fig. 3.

Fig. 3
figure 3

The model diagram of the proposed PDCF-DRL scheme

The proposed PDCF-DRL scheme employs a feedforward Q-function network structure and uses a Deep Q-Network (DQN) to model the Q-function. By representing the action space as a vector, the predicted values of the maximum Q-value in the DQN are updated. To adjust the model parameters and minimize the difference between predicted values and actual values, the Adam optimization algorithm is chosen. An \(\epsilon\) strategy is implemented in the model for exploration. In the early stages of training, a larger Q value is used to facilitate more exploration, and as training progresses, the degree of exploration gradually decreases, allowing for more exploitation to select the largest subscript of Q-value for its current state. Experience is extracted from the experience replay pool, which is used to store and execute the agent’s operations for the next state, improving the efficiency and stability of the training. The specific algorithm flow is shown in Algorithm 2. The historical collision rate (H(\(P_{col}\))) and the throughput (TP) of each transmission period are used as inputs, and the output is the \(CW_{comp}\) explored by the DQN model. First, the experience pool D and some key parameters are initialized, \(CW_{comp}\) is initialized to 0, and the state is initialized to a zero vector. The weights of the evaluation network and the target network should be consistent, i.e., \(\theta ' = \theta\). After the initialization is complete, training begins, and each time step is executed. For each time step, the agent (AP) first observes the processed H(\(P_{col}\)) to perceive the current network state(s). The current network (Q-Network) uses the \(\epsilon\)-greedy method to select an action (a) in the current state (s). The agent (AP) calculates the new \(CW_{comp}\) based on the selected action and starts the channel access process according to Algorithm 1. All stations update their respective CW sizes according to \(CW_{comp}\). After a transmission period, the agent (AP) evaluates the network performance of this period, calculates the reward r (normalized throughput) and the next state \(s'\), and then stores the experience sample (s, a, r, \(s'\)) of this period in the experience pool. Then, a small batch of samples d is extracted from the experience pool D to compute the time-difference target for each sample. Gradient descent is used to update the parameters \(\theta\) in the loss function L(\(\theta\)). Then, the target network is updated, keeping the parameters of the evaluation network and the target network consistent in every C step. When the training phase is completed, the second phase, the evaluation phase, begins. During the remaining exploration attempts, the agent only needs to observe the state and obtain actions from the trained model without needing to receive rewards anymore, as the agent is considered fully trained and has stopped receiving updates.

Algorithm 2
figure b

The Reinforcement Learning Process of the PDCF-DRL Algorithm

5 Simulation experiment and analysis

5.1 Simulation parameter settings

In order to verify the effectiveness of the PDCF-DRL scheme proposed in this paper, we simulated the Medium Access Control (MAC) method of IEEE 802.11. By utilizing the PaddlePaddle reinforcement learning framework provided by Baidu and the Python programming language, we implemented the PDCF-DRL Scheme. The network scenario of the experiment is shown in Fig. 4, including a single AP and multiple access stations with different AC types. It uses the IEEE 802.11ac standard with 256-QAM modulation, 5/6 coding rate, 80 MHz channel, 11 Mbps channel bit rate, slot time of 9 microseconds, SIFS time of 16 microseconds, packet payload of 8184 bits, MAC header of 288 bits, PHY header of 128 bits, and ACK of 112 bits plus the PHY header. Although the maximum transmission rate of 802.11ac can reach approximately 3.46 Gbps at 80 MHz, we set the channel transmission rate to 11 Mbps for the experiment. This choice allows for the rapid collection of network state information, providing valuable input for the DQN model. Furthermore, it is assumed that (1) the AP is aware of the current network state (i.e., the current network collision rate and throughput), (2) the AP periodically broadcasts the \(CW_{comp}\) value via beacon frames, and (3) each station sends data of a single AC type. The simulation is performed under an ideal channel with a zero Bit Error Ratio (BER) and all stations are stationary. The stations send packets at equal and constant rates, and a frame is generated immediately after each successful transmission or discard, ensuring that the stations always have data waiting to be sent.

The key parameters of the DQN algorithm in the experiments are shown in Table 4. The network used by the algorithm is a fully connected DNN with three hidden layers, each with 128 nodes, and ReLU as the activation function. The detailed parameter explanations are as follows: Learn Frequency: The model parameters are updated every 5 time steps. This approach helps the model learn in a more stable state and avoids instability caused by frequent updates. Memory Size: The size of the experience replay buffer is set to store a maximum of 20,000 experience samples. This allows the agent to sample from past experiences, which helps break temporal correlations and improves learning efficiency. Batch Size: The batch size for random sampling from the experience replay during training is set to 32. A smaller batch size can enhance the model’s generalization ability and make the training process more stable. Learning Rate (\(\alpha\)): Set to 0.001, this parameter determines the magnitude of the Q-value adjustments during each update. A smaller learning rate can lead to a smoother learning process, but it may slow down convergence. Epsilon Greedy: Set to 0.1, which means the agent has a 10% chance to randomly select an action and a 90% chance to choose the current optimal action. This balances the relationship between exploring new strategies and utilizing learned strategies. Epsilon Greedy Decrement: The \(\varepsilon\) value is decreased by \(1 \times 10^{-6}\) with each iteration. This gradual reduction allows the agent to slowly decrease its exploration frequency during learning, thus focusing more on exploiting its acquired knowledge. \(Step_{\text {max}}\): The maximum allowed time steps per episode is set to 200, which limits the length of each episode. \(Episode_{\text {max}}\): The maximum number of episodes is set to 50, which controls the total duration of training and ensures that training is completed in a reasonable time frame.

In the experiment, the access parameters for different AC categories are shown in Table 5. The CW range of each AC is dynamically adjusted by the \(CW_{comp}\) value, and the initial value of \(CW_{comp}\) is set to 0. Other access parameters, such as AIFS, DIFS, and \(TXOP_{limit}\), are determined according to the rules defined or set in the EDCA mechanism. Finally, due to the need to use DRL to sense the current network conditions and explore a suitable CW compensation value, the required overhead is significant. Therefore, exploration is conducted once per DCF period. During a DCF period, \(CW_{thr}\) can be recalculated multiple times due to its low computational overhead, allowing for better adaptation to current network conditions. In our experiments, a DCF period is defined as 1 min. \(CW_{thr}\) is used to adjust the CW value, calculating and regenerating it after a data transmission is completed.

Fig. 4
figure 4

The model diagram of the proposed PDCF-DRL scheme

Table 4 DQN algorithm parameters
Table 5 Access categories and related parameters

5.2 Simulation and analysis

The PDCF-DRL scheme uses a DRL technique where the input to the DQN model is the historical collision rate and the output is the adjusted \(CW_{comp}\) value. The reward function is designed as a normalized throughput. The model stores historical states, actions, and reward values through an experience pool for updating and optimizing during training. To validate the model’s perceptual ability under different network conditions, we designed an experiment to explore the variation of the \(CW_{comp}\) value and its impact on the collision rate and throughput under different station densities. The experimental setup is as follows: The number of learning rounds is 50, and the total number of competing stations ranges from 20 to 120, with an interval of every 20 stations and an equal number of different types of stations for better experimentation and analysis.

With a total of 20 competing stations, the collision rate during certain exploration rounds has increased significantly, leading to a substantial decrease in throughput. The compensation values for 50 rounds of exploration are shown in Fig. 5, while the corresponding collision rates and throughput are shown in Figs. 6 and 7, respectively. In exploration rounds 21, 25, 27, and 28, the \(\text {CW}_{\text {comp}}\) value is 0, resulting in collision rates of approximately 56.26%, 57.46%, 64.34%, and 61.26%, respectively. The corresponding normalized throughput also decreases significantly, with approximately 38.54%, 37.02%, 35.45%, and 36.45%. This is because when the \(\text {CW}_{\text {comp}}\) value is 0, the CW range for AC_VO stations is (7, 15) and for AC_VI stations is (15, 31). With 5 AC_VO and 5 AC_VI stations out of the 20 stations, the small CW ranges lead to a high probability of collisions, severely reducing throughput. On the other hand, a \(\text {CW}_{\text {comp}}\) value of 768 during exploration rounds 4, 8, 11, and 23 corresponds to normalized throughput values of approximately 66.51%, 64.46%, 65.84%, and 67.26%, respectively. These significant reductions in throughput are due to the excessively large \(\text {CW}_{\text {comp}}\) value, which greatly increases the waiting latency for stations to send data. As the number of exploration rounds increases, the explored \(\text {CW}_{\text {comp}}\) value gradually stabilizes and converges to the value of 128 at round 33, which corresponds to the normalized throughput stabilizing between 79.37% and 88.04%, and the collision rate also stabilizes between 0.26% and 6.64%.

Fig. 5
figure 5

\(CW_{comp}\) of 20 stations

Fig. 6
figure 6

Collision rate of 20 stations

Fig. 7
figure 7

Throughput of 20 stations

Fig. 8
figure 8

\(CW_{comp}\) of 40 stations

Fig. 9
figure 9

Collision rate of 40 stations

Fig. 10
figure 10

Throughput of 40 stations

With a total of 40 competing stations, the collision rate for some exploration rounds is notably high, resulting in a significant decrease in corresponding throughput. The compensation values for 50 rounds of exploration are shown in Fig. 8, while the corresponding collision rates and throughput are shown in Figs. 9 and 10, respectively. Specifically, for exploration rounds 8, 13, 17, 18, 19, 20, 25, and 28, the \(CW_{comp}\) value is 0. During these rounds, the collision rates sharply increase to approximately 64.31%, 60.52%, 66.42%, 62.66%, 68.71%, 66.45%, 60.58%, and 60.49%, respectively. Correspondingly, the normalized throughput decreases significantly to about 34.25%, 35.71%, 33.32%, 33.81%, 32.91%, 33.34%, 33.32%, and 35.60%, respectively. This can be attributed to the fact that when the \(CW_{comp}\) value is 0, the increase in the number of competing stations leads to more serious collisions between AC_VO and AC_VI stations, while the collision rate for AC_BE and AC_BK stations also increases rapidly. In exploration rounds 20 and 23, the explored \(CW_{comp}\) value is 896. Although the collision probabilities are approximately 2.64% and 5.62%, respectively, the waiting delays for the stations increase significantly, resulting in a notable decrease in throughput to about 52.69% and 54.05%. As the number of exploration rounds increases, the \(CW_{comp}\) value stabilizes and converges to 256 by round 34. This stabilization corresponds to normalized throughput stabilizing between 80.76% and 83.95%, and the collision rate stabilizes between 1.28% and 10.63%. Under light network load, the smaller number of competing stations results in a lower collision rate. Consequently, the DQN model tends to select a smaller \(CW_{comp}\) to minimize station waiting delays, while maintaining a low collision rate. This allows stations to transmit data more quickly, enhancing overall network throughput and yielding higher rewards.

With a total of 60 competing stations, different \(CW_{comp}\) values result in significant variations in collision rate and throughput. The compensation values for 50 rounds of exploration are shown in Fig. 11, while the corresponding collision rates and throughput are shown in Figs. 12 and 13, respectively. Specifically, in exploration rounds 1, 8, 18, 23, 27, and 30, the explored \(CW_{comp}\) value is 0. This corresponds to a substantial increase in the collision rate, approximately 66.25%, 68.46%, 74.63%, 72.28%, 33.43%, and 70.76%, respectively. Consequently, the normalized throughput decreases significantly to around 33.48%, 30.84%, 27.38%, 28.87%, 27.56%, and 29.71%, respectively. In exploration rounds 11 and 15, the explored \(CW_{comp}\) is 128, resulting in a noticeable decrease in collision rates to 32.67% and 28.65%, respectively. The corresponding normalized throughput increases to approximately 50.91% and 54.66%, respectively. In exploration rounds 32 and 33, the explored \(CW_{comp}\) is 896, corresponding to collision rates of about 0.26% and 6.04% and normalized throughput of approximately 71.81% and 70.91%. Although increasing the \(CW_{comp}\) value effectively reduces the collision rate, an excessively high \(CW_{comp}\) value increases transmission delays, adversely affecting network performance and making throughput improvements less effective. As the number of exploration rounds increases, the explored \(CW_{comp}\) gradually stabilizes and converges to 384 by round 37. This corresponds to normalized throughput stabilizing between 78.81% and 84.17% and collision rates stabilizing between 0.46% and 13.25%. Both collision rate and station waiting delay are crucial factors affecting throughput, so the DQN model must balance these to achieve optimal network performance.

Fig. 11
figure 11

\(CW_{comp}\) of 60 stations

Fig. 12
figure 12

Collision rate of 60 stations

Fig. 13
figure 13

Throughput of 60 stations

Fig. 14
figure 14

\(CW_{comp}\) of 80 stations

Fig. 15
figure 15

Collision rate of 80 stations

Fig. 16
figure 16

Throughput of 80 stations

With a total of 80 competing stations, the \(CW_{comp}\) for exploration tends to select larger values. The compensation values for 50 rounds of exploration are shown in Fig. 14, while the corresponding collision rates and throughput are shown in Figs. 15 and 16, respectively. In exploration round 9, the explored \(CW_{comp}\) is 128, resulting in a significantly higher collision rate of approximately 38.25% and a corresponding decrease in normalized throughput to about 51.77%. As the network load increases, the DQN model prefers exploring larger \(CW_{comp}\) values to reduce the collision rate. For example, in exploration round 1, a \(CW_{comp}\) value of 896 corresponds to a collision rate of about 2.34% and a normalized throughput of approximately 72.72%. Although the effectiveness of throughput enhancement is reduced, the throughput increases compared to when the \(CW_{comp}\) value is 896 under lightly loaded network conditions. As the number of exploration rounds increases, the explored \(CW_{comp}\) gradually stabilizes and converges to 512 by round 37. This corresponds to a stable normalized throughput ranging from 78.91% to 82.24% and a stable collision rate ranging from 0.67% to 13.25%.

Fig. 17
figure 17

\(CW_{comp}\) of 100 stations

Fig. 18
figure 18

Collision rate of 100 stations

Fig. 19
figure 19

Throughput of 100 stations

Fig. 20
figure 20

\(CW_{comp}\) of 120 stations

Fig. 21
figure 21

Collision rate of 120 stations

Fig. 22
figure 22

Throughput of 120 stations

When the total number of competing stations reaches 100, the \(CW_{comp}\) for exploration tends to choose larger values. The compensation values for 50 rounds of exploration are shown in Fig. 17, while the corresponding collision rates and throughput are shown in Figs. 18 and 19, respectively. During exploration rounds 13 and 36, the explored \(CW_{comp}\) is 128, which leads to a significant increase in the collision rate to about 40.26% and 45.64%, respectively. Correspondingly, the normalized throughput is significantly reduced to around 50.94% and 49.92%, respectively. In exploration rounds 4, 6, 17, 27, and 32, the explored \(CW_{comp}\) is 384. As the CW range of the AC increases, the corresponding collision rate decreases to some extent, reaching approximately 22.67% and 18.65%, respectively. The corresponding normalized throughput also improves, reaching 66.36%, 69.09%, 67.27%, 67.39%, and 65.45%, respectively. As the number of exploration rounds increases, the explored \(CW_{comp}\) gradually stabilizes and converges to 640 by round 37. This corresponds to a stable normalized throughput ranging from 77.27% to 84.45% and a stable collision rate between 2.89% and 12.38%. With a total number of 120 competing stations, the explored \(CW_{comp}\) tends to choose a larger value. The compensation values for 50 rounds of exploration are shown in Fig. 20, while the corresponding collision rates and throughput are shown in Figs. 21 and 22, respectively. At exploration rounds 1 and 2, the explored \(CW_{comp}\) is 0, which corresponds to the highest collision rate, about 96.24% and 87.65%, and the corresponding normalized throughput is the lowest, about 2.36% and 4.18%, respectively. In exploration rounds 9, 14, 27, and 32, the explored \(CW_{comp}\) is 384. With the increase of \(CW_{comp}\), the collision rate decreased to a certain extent, but the collision rate was still very high, which were 33.34%, 36.28%, 35.66%, and 38.42%, respectively. The corresponding normalized throughput also improves, reaching 65.29%, 64.12%, 64.05%, and 63.06%, respectively. As the number of exploration rounds increases, the explored \(CW_{comp}\) gradually stabilizes and converges to the value of 768 in 34 rounds, which corresponds to a stable normalized throughput between 76.81% and 83.36% and a stable collision rate between 4.38% and 12.56%.

The experimental analysis reveals that during the early stages of exploration, the DQN focuses on stochastic exploration. It achieves this by depositing processed historical collision rates (observed states), selected actions, and normalized throughput (reward values) into the experience pool to adjust and optimize its selection strategy. In the later stages, as the model training progresses, the DQN consistently selects the optimal action to determine the optimal \(CW_{comp}\) value. Consequently, with an increasing number of learning rounds, the \(CW_{comp}\) value gradually stabilizes and converges, exhibiting no significant fluctuations. Moreover, the converged \(CW_{comp}\) value tends to increase with the number of competing sites. When the network load is light and the number of competing stations is small, the model converges to a smaller \(CW_{comp}\) value. This results in a relatively narrow CW range for each AC queue, preventing excessive packet transmission delays in a well-performing network. Conversely, when the network load is heavy and the number of competing stations is large, the model converges to a larger \(CW_{comp}\) value. This adjustment reduces the likelihood of inter-station collisions in a congested network environment. Thus, the algorithm dynamically adapts to varying network conditions, ensuring optimal performance even in high-density station scenarios. To verify the effectiveness of the PDCF-DRL scheme in guaranteeing QoS, experiments were conducted to analyze the average waiting delay and average throughput of different AC queues. In these experiments, the number of competing stations ranged from 20 to 140 (with equal numbers of stations for different service types). The corresponding average waiting delays for different AC queues are shown in Fig. 23. For low-priority services (AC_BE, AC_BK), the channel access delay is higher. When the number of stations is 20, the average waiting delay for AC_BE and AC_BK is approximately 0.00129493 (s) and 0.00136485 (s), respectively. As the number of stations increases, the channel access delay for low-priority services increases significantly. At 120 stations, the average waiting delays for AC_BE and AC_BK are about 0.00348781 (s) and 0.00366739 (s), respectively. High-priority services (AC_VO and AC_VI) experience smaller channel access delays. At 20 stations, the average waiting delays for AC_VO and AC_VI are about 0.00020793 (s) and 0.00036692 (s), respectively. Although an increase in the number of stations leads to more channel contention, high-priority services maintain their QoS by sacrificing the channel access delay of low-priority services. For 120 stations, the average waiting delays for AC_VO and AC_VI are approximately 0.000825778 (s) and 0.000883536 (s), respectively. throughput, the amount of data successfully transmitted per unit time, reflects the efficiency and rate of data transmission in the network. The normalized throughput for different AC queues is shown in Fig. 24. The results indicate that the normalized throughput of the proposed PDCF-DRL scheme does not decrease dramatically with an increasing number of competing stations. With 20 competing stations, the normalized throughput for AC_VO, AC_VI, AC_BE, and AC_BK are about 22.52%, 22.26%, 20.91%, and 20.72%, respectively. When the number of competing stations increases to 120, the normalized throughput for AC_VO, AC_VI, AC_BE, and AC_BK are approximately 20.59%, 20.38%, 18.99%, and 18.36%, respectively, representing reductions of 1.93%, 1.88%, 1.92%, and 2.36%, respectively. The PDCF-DRL scheme proposed in this paper reduces the collision rate by increasing the CW range (CW_comp value) for AC queues under high-load scenarios and employs an adaptive backoff strategy to reduce station waiting delays. Consequently, the algorithm effectively guarantees the throughput of different AC queues under high-load conditions. Experimental results demonstrate that the PDCF-DRL scheme performs well in terms of average waiting delay and average throughput for different services, effectively distinguishing service types and providing appropriate QoS guarantees for each service type.

Fig. 23
figure 23

Average waiting delay corresponding to different ACs

Fig. 24
figure 24

Normalized throughput corresponding to different ACs

Finally, the total collision rate and total throughput of traditional CSMA mechanism, EDCA mechanism, and deep reinforcement learning-based CCOD-DQN [10] scheme, SETL-DQN [12] scheme, and PDCF-DRL scheme proposed in this paper are compared for different station densities. To conduct a detailed analysis and comparative experiments, we categorized the network scenarios into two types: one type consists of single traffic scenarios, while the other type includes mixed traffic scenarios. In the single traffic scenario, all stations only send a single type of data traffic. Under AC_VO traffic, the collision rates and throughput for all schemes are shown in Figs. 25 and 26. It is evident that traditional CSMA and EDCA schemes perform poorly in terms of collision rates and throughput. In the case of 20 stations, the collision rates are 63.92% and 64.28%, respectively, with corresponding throughput of 33.82% and 32.18%. For 120 stations, the collision rates are 83.75% and 84.62%, with throughput of 13.64% and 13.78%. Furthermore, the differences in collision rates and throughput for CSMA and EDCA across varying station numbers are minimal, primarily because both utilize the BEB (Binary Exponential Backoff) mechanism in their backoff strategies, differing mainly in AIFS times and TXOP. As the number of stations increases, competition in the network intensifies, leading to a significant rise in conflicts between stations. The backoff strategies of both EDCA and CSMA are unable to effectively alleviate these conflicts, resulting in similar performance for both in the experiments. The DRL-based CCOD-DQN and SETL-DQN schemes exhibit collision rates of 46.42% and 47.25%, respectively, and corresponding throughput of 48.56% and 47.72% with 20 stations. With 120 stations, the collision rates are 74.65% and 76.81%, and the throughput are 22.41% and 23.64%. Although these schemes employ deep reinforcement learning techniques and improved backoff strategies, they do not significantly reduce collision rates or increase throughput. This is primarily due to the contention window range for AC_VO being (7,15), and neither CCOD-DQN nor SETL-DQN adjusts the contention window range for different ACs. The combination of a small contention window and a large number of stations leads to severe conflicts, resulting in a drastic increase in collision rates and a rapid decline in throughput. In contrast, the proposed PDCF-DQN scheme demonstrates excellent performance, with collision rates of 6.89% and 16.38% for 20 and 120 stations, respectively, and corresponding throughput of 84.98% and 76.25%. This improvement is attributed to the PDCF-DQN scheme’s ability to dynamically adjust the contention window range for ACs using CW compensation values. When the number of stations is high, it increases the contention window range for AC_VO and employs an adaptive backoff strategy to reduce conflicts, thereby maintaining good performance.

Fig. 25
figure 25

Collision rates for AC_VO

Fig. 26
figure 26

Throughput for AC_VO

In the AC_VI traffic scenario, the collision rates and throughput for each scheme are shown in Figs. 27 and 28. Clearly, the traditional CSMA and EDCA schemes perform poorly in terms of collision rates and throughput. With 20 stations, the collision rates are 49.26% and 48.83%, respectively, with corresponding throughput of 47.85% and 48.48%. For 120 stations, the collision rates are 80.24% and 81.26%, while the throughput are 17.32% and 16.88%. The collision rates for the DRL-based CCOD-DQN and SETL-DQN schemes are 26.21% and 27.13%, respectively, with corresponding throughput of 67.33% and 66.41%. For 120 stations, the collision rates are 60.89% and 59.83%, with throughput of 36.41% and 36.86%. Due to the contention window range for AC_VI being only (15,31), these schemes only made improvements to the CW adjustment strategy, resulting in limited performance. In contrast, the proposed PDCF-DQN scheme demonstrates excellent performance. With 20 and 120 stations, the collision rates are 4.96% and 14.76%, respectively, with corresponding throughput of 85.86% and 76.32%. The PDCF-DQN scheme dynamically adjusts the contention window range for AC_VI using continuous wave compensation values, increasing the contention window for AC_VI when the number of stations is high, thereby reducing the likelihood of conflicts and maintaining good performance.

Fig. 27
figure 27

Collision rates for AC_VI

Fig. 28
figure 28

Throughput for AC_VI

In the AC_BE traffic scenario, the collision rates and throughput for each scheme are shown in Figs. 29 and 30. Due to the contention window range of AC_BK being (31,1023), there is a significant decrease in collision rates and a corresponding increase in throughput across the schemes. However, the traditional CSMA and EDCA schemes perform poorly in terms of both collision rates and throughput. With 20 stations, the collision rates are 28.26% and 28.83%, respectively, with corresponding throughput of 64.81% and 64.47%. For 120 stations, the collision rates are 46.08% and 47.16%, while the throughput are 48.46% and 47.98%. The DRL-based CCOD-DQN [10]and SETL-DQN [12] schemes exhibit collision rates of 9.85% and 13.46%, respectively, with corresponding throughput of 80.67% and 81.26%. With 120 stations, the collision rates are 17.42% and 16.53%, and the throughput are 78.32% and 78.26%. These schemes utilize DRL technology to explore and adjust the CW window based on network conditions, significantly enhancing network performance. The proposed PDCF-DQN scheme also demonstrates excellent performance, with collision rates of 4.84% and 10.76% for 20 and 120 stations, respectively, and corresponding throughput of 88.75% and 81.32%.

Fig. 29
figure 29

Collision rates for AC_BE

Fig. 30
figure 30

Throughput for AC_BE

In the AC_BK traffic scenario, the collision rates and throughput for each scheme are shown in Figs. 31 and 32. Due to the contention window range of AC_BK being (31,1023), there is a significant decrease in collision rates and a corresponding increase in throughput across the schemes. However, the traditional CSMA and EDCA schemes perform poorly in terms of both collision rates and throughput. With 20 stations, the collision rates are 27.39% and 29.53%, respectively, with corresponding throughput of 67.32% and 66.82%. For 120 stations, the collision rates are 47.62% and 48.23%, while the throughput are 46.71% and 46.39%. The DRL-based CCOD-DQN [10] and SETL-DQN [12] schemes exhibit collision rates of 11.03% and 10.51%, respectively, with corresponding throughput of 81.38% and 80.25%. With 120 stations, the collision rates are 18.81% and 18.03%, and the throughput are 77.23% and 76.84%. Similar to performance under AC_BE traffic, these DRL-based schemes significantly reduce collision rates and maintain stable throughput. The proposed PDCF-DQN scheme also demonstrates excellent performance, with collision rates of 3.04% and 9.31% for 20 and 120 stations, respectively, and corresponding throughput of 89.79% and 84.26%.

Fig. 31
figure 31

Collision rates for AC_BK

Fig. 32
figure 32

Throughput for AC_BK

To accurately reflect real network conditions, the second scenario features multiple AC traffic types, with equal numbers of stations assigned to each of the four distinct AC types, each responsible for transmitting its corresponding traffic. Overall throughput and collision rates are shown in Figs. 33 and 34. It is evident that traditional CSMA and EDCA schemes exhibit the highest collision rates and throughput. With 20 stations, their collision rates are 47.28%and 45.46%, respectively, corresponding to throughput values of 41.82% and 43.64%. When the number of stations increases to 120, the collision rates rise to 82.52% and 81.47%. These schemes utilize the traditional BEB backoff algorithm to adjust the contention window, which struggles to adapt to changing network conditions, leading to significant performance degradation. In contrast, the DRL-based CCOD-DQN [10] and SETL-DQN [12] schemes also show a noticeable downward trend in performance. With 20 stations, their collision rates are 20.39% and 24.21%, respectively, with corresponding throughput values of 64.84% and 63.96%. When the number of stations increases to 120, the collision rates are 62.14% and 60.91%, with throughput values of 31.71% and 32.12%. Although these schemes employ DRL technology to adjust the contention window based on real-time network conditions, their performance is significantly limited by the excessively small contention windows for AC_VI and AC_VO. In contrast, the proposed PDCF-DQN scheme achieves collision rates of 2.35% and 12.29% with corresponding normalized throughput values of 83.45% and 81.37% for 20 and 120 stations, respectively, demonstrating excellent performance. The proposed PDCF adjusts the AC range using CW compensation values and employs an adaptive backoff strategy, effectively catering to both single and complex mixed traffic scenarios.

Fig. 33
figure 33

Collision rates for multiple AC

Fig. 34
figure 34

Throughput for multiple AC

5.3 Summary and analysis

The CSMA/CA protocol utilizes the Binary Exponential Backoff (BEB) algorithm to adjust the CW, effectively reducing collisions in small-scale networks and light loads. However, as network size and load increase, collision rates surge, and throughput decreases. The EDCA mechanism, aiming to provide good QoS support for real-time information services, proposes four ACs and access standards to meet the diverse service requirements. However, since the EDCA mechanism stipulates that the CW values of each priority service can only start from \(CW_{min}\) and adopts the BEB backoff algorithm to adjust the CW, coupled with the fact that the \(CW_{max}\) for high-priority services is relatively small, it leads to severe collisions when there are many stations. This makes it challenging to provide QoS guarantees for real-time multimedia services. The CCOD-DQN [10] scheme leverages DRL to predict the optimal CW values, aiming to enhance Wi-Fi network throughput and reduce collision rates under high loads. However, its use of the BEB algorithm for CW prediction and adjustment struggles to adapt to dynamic network changes, leading to increased transmission latency and decreased throughput due to the limitations of the BEB algorithm. In contrast, the SETL-DQN [12] scheme utilizes deep reinforcement learning to determine appropriate CW thresholds and adjusts CW using the SETL algorithm based on network load conditions. It employs an exponential adjustment strategy in light-load networks and a linear adjustment in heavy-load networks to reduce collision rates and increase throughput. Although deep reinforcement learning-based solutions significantly enhance Wi-Fi network performance, they often overlook the differentiation of service types, potentially compromising service quality (QoS) for various service types in scenarios with multiple services. To address this issue, this paper proposes a PDCF-DRL scheme. By leveraging deep reinforcement learning technology, this scheme dynamically adjusts the CW ranges corresponding to different service types based on current network conditions. It also adopts an adaptive backoff algorithm to optimize the station backoff process, thereby ensuring QoS and maximizing network performance (Tables 6, 7, 8).

Table 6 Mechanism characteristics(1)
Table 7 Mechanism characteristics(2)
Table 8 Comparative analysis of different schemes (Apply mode)

6 Conclusion

This paper proposes a deep reinforcement learning-based contention window backoff algorithm that differentiates network services. By applying deep reinforcement learning techniques on the basis of the EDCA mechanism, a CW compensation value adaptable to the current network environment is explored to update the contention window ranges corresponding to AC queues. Simultaneously, a novel adaptive backoff algorithm is introduced to adopt different backoff strategies based on the network congestion level, thus partially mitigating the performance challenges faced by Wi-Fi technology. Experimental results demonstrate that the algorithm distinguishes between different types of services and exhibits significant performance advantages in improving network throughput and reducing collision rates. Future research directions could expand to more complex network topology and business scenarios to further optimize the algorithm’s applicability and performance.