1. Introduction
Immersion and realism are crucial elements for games that aim to provide players with enjoyment [
1]. The achievement of immersion and realism requires the collaboration of various in-game components, one of which is the integral audio system. Within the sound system of a game, in-game audio cues play a fundamental and influential role. Audio cues are sounds that provide information about the game state, such as the location, distance, and movement of enemies, allies, and objects [
2]. Previous research has shown that they are effective in enhancing player abilities [
2,
3] and contribute to a heightened sense of immersion during gameplay [
4,
5]. This is particularly noticeable in first-person shooter (FPS) games, where audio cues are instrumental in helping to accurately identify potential threats [
2].
Audio perception in games is a controversial topic. In 2018, a study on audio perception in virtual reality (VR) environments stated that user perception of audio in such environments is not as important as visual perception [
6]. In contrast, other studies have shown that audio has a high level of importance in VR environments. For example, Eames [
7] highlighted that the design of audio in VR environments can deliberately guide the player’s attention to achieve a better narrative effect. Lucas et al. [
8] designed a VR audio game with a free-motion interface specifically for mobile phones, and their participants generally found the audio design of this game to be an enjoyable and challenging activity. In addition, another study showed that designing or listening to music in the right environments can give users a better sense of immersion [
9]. Considering these studies, we infer that audio design can have different effects, even in the same environments, with some presenting a positive effect of audio design and others showing that audio is not particularly helpful.
However, some studies did not show how changes in audio loudness would affect players, which cannot fully reflect the effect of audio in games. In contrast, in our previous study [
10], we used Gaussian process regression (GPR) to balance player performance with data from test players, setting game difficulty according to designer preferences. We revealed the relationship between player performance and in-game audio cue volume in an FPS game and developed a DDA method for personalized audio volume control. However, in the game domain, if the player’s experience is subpar, we cannot claim that our method is justified in practical applications, even though it successfully controls difficulty [
11]. Therefore, this paper extends our previous study by investigating player experience under our proposed DDA method.
In summary, this paper primarily addresses the following research questions (RQs):
- RQ 1:
Can players have a positive gaming experience when a DDA mechanism is implemented through game volume?
In this paper, we use questions taken from the Game User Experience Satisfaction Scale (GUESS) questionnaire [
12] to investigate in detail whether or not players will have a more positive gaming experience with the DDA mechanism proposed in our previous study.
- RQ 2:
How does the game volume affect the player’s gaming experience?
To better understand and analyze the player experience, we asked players to share with us their feelings and opinions in writing, using their native languages. We anticipate that these kinds of open-ended responses may consist of multiple languages, which means it may be difficult to find qualified human evaluators. Therefore, we used a state-of-the-art large language model (LLM) to help us in the language analysis task. Although LLMs have proven powerful in analyzing natural languages [
13], they sometimes hallucinate [
14]. We alleviated this issue by designing a procedure that enables an LLM to reliably help a small number of human evaluators classify player responses.
Our study makes three notable contributions. Firstly, we investigate whether DDA through in-game audio cue volume setting—using the aforementioned goal recommendation algorithm and the player performance modeled by GPR—can give the player a positive gaming experience. This work reveals that in-game audio cue volume, if appropriately designed, can give the player a positive gaming experience. Secondly, we identify what affects the player’s gaming experience in the FPS game in use, equipped with the aforementioned DDA mechanism. Finally, the proposed procedure for having an LLM participate in the player-response classification task with only a few human evaluators is effective and can be used as a reference for other applications.
3. Preliminary Study
Inspired by the popular topic in the game industry regarding the gap between designers’ intentions and players’ actual experiences, we aim to narrow this gap using the in-game mechanism called DDA. We targeted an important in-game element that might have been underestimated: in-game audio cues. Previous research [
2] has indicated that adjusting audio cue volume can, to some extent, train players’ gaming skills in an FPS game, but the authors did not specify how to measure the volume. In our previous study [
10], we applied the DDA mechanism exclusively to audio cue volume, clarifying our approach as shown below.
Firstly, it is important to understand that several measures are used to determine audio volume, including decibels, sound pressure levels, and loudness. In addition, we recognize that background noise and individual hearing sensitivity can affect the perceived loudness of a sound [
39]. To control these variables, we need to standardize the players’ equipment and their external environment during gameplay and test their hearing abilities in advance. However, due to the inherent variability of real-world conditions, algorithms developed under controlled testing conditions may have limited practical applications. Therefore, our previous study aimed to achieve robust results even when players are in different environments and using their own devices, representing a more general case.
To achieve this goal, we controlled the most straightforward aspect of volume that game players can understand: the system volume, which is managed by an audio-tapered volume control. According to Microsoft’s documentation on audio-tapered volume controls (
https://learn.microsoft.com/en-us/windows/win32/coreaudio/audio-tapered-volume-controls (accessed on 5 November 2024)), we understand that this method produces an approximately linear relationship between the volume setting and perceived loudness. Based on this information, we divided the control’s value range (0 to 100) into equal segments (0, 25, 50, 75, 100) to ensure a consistent perception of volume levels when increasing or decreasing the value by one segment. However, the decibel range of the volume controls depends on the audio endpoint device, meaning we cannot determine the exact decibel value for each player. This is not crucial for us, as our proposed DDA solution aims to be highly personalized. Regardless of different settings, environments, or gaming skills, we can provide an effective DDA solution tailored to each player.
In summary, both our previous and current studies manage the system volume, which we believe enhances the applicability of our method. The system volume value represents the percentage of the output volume of the audio endpoint device. For example, if a headset’s maximum output is 120 decibels, then when the system volume is set to 0, 50, and 100, the actual output in decibels will be 0, 60, and 120, respectively. Therefore, changing the volume in this paper refers to the amplitude of the audio signal, which is related to the time domain. Increasing or decreasing the volume means that the audio output device receives a stronger or weaker electrical signal, producing more powerful or less powerful sound waves.
From another perspective considered in our previous study, balancing designer preferences and player performance is also a challenging task. When designing a level, designers determine whether the current design is more difficult or easier, but players may not perceive it the same way. It is important to understand that the player population has a wide range of gaming skills. This means that even if the game designer considers an adjustment to be easy, some players may still find it difficult. Similarly, if the game designer considers an adjustment to be difficult, some players may find it not very challenging.
Considering all aforementioned concerns, we employed GPR, which can quickly and effectively adapt to players for rapid player modeling. In addition, we designed a new goal for our DDA mechanism that balances designer preferences and player performance. Furthermore, we selected a relatively objective in-game variable, the game clear time, to evaluate player performance. In
Section 4.3, we detail the current metric for evaluating player performance. By assessing player performance, which can be influenced by changes in their gaming device or environment and result in different perceptions of audio loudness, we can demonstrate that our DDA can still quickly adapt to each player. Algorithm 1 describes the entire DDA process, referred to as GPR-DDA, and Equation (
1) illustrates how we balance designer preference and player performance.
Algorithm 1: GPR-DDA |
Input: Output: Initialize while do end while |
This algorithm shows that, given a goal called , we use GPR to determine an audio setting that would make the current player’s performance close to . In this study, the term “goal” refers to the degree of challenge in the . is a line representing a designer’s preferences, mapping audio cue settings to players’ performance. This line can be obtained by using prior data to calculate the average performance of test players. represents the current player’s data set, consisting of audio setting i and the player’s performance at this setting. is a GPR model that returns the current player’s predicted performance, P, for a given audio setting, . In this algorithm, C represents the respective performance of the current player and T represents the respective performance of prior data derived beforehand for each game level, either objectively from test players or subjectively from game designers. The subscripts Best and Worst, respectively, stand for their best and worst performance among all levels, based on both the actual performance at each seen level and the predicted performance by GPR for each unseen level. is the value used to validate our model’s performance, where a value closer to zero indicates a better prediction.
The
, which determines
, is the method we used for balancing player performance and designer preference. This algorithm is called the goal recommendation algorithm (Equation (
1)). By choosing the enemy audio cue volume setting based on this algorithm in each round, DDA is realized, aligning players’ performance with their recommended goals.
The subscripts
,
, and
are coefficients ranging from 0 to 1, representing the current player’s best and worst performance, the best and worst performance from prior data, and the weight between these two values, respectively. These coefficients give game designers the flexibility to emphasize the more important terms according to their preferences. This algorithm computes the goal by considering both player performance and designer preference to avoid the game being overly easy, which can lead to player boredom, or excessively challenging, which can lead to player frustration [
25].
Based on our previous studies’ results, GPR-DDA bridges the gap in audio cue volume settings between designer preferences and player performance, achieving this with a relatively fast response time. However, as a mechanism applied in an FPS game, we need to understand whether or not GPR-DDA can provide a positive overall gaming experience for players. Therefore, analyzing players’ experiences is important for evaluating our previously proposed method. In this paper, we follow the game design of our previous study, with a reasonable modification of the game map mentioned in
Section 4.2. In addition, we set up two phases in the experiment, consistent with our previous study, to test whether or not GPR-DDA can provide a positive gaming experience.
4. Methodology
We divide our experiment into two distinct phases, each with its own unique group of players. In the first phase, players will be randomly divided into two groups; one group will play multiple levels with audio cues of increasing volume, and the other group vice versa. In contrast, in the second phase, players play multiple levels, at each of which GPR-DDA performs the audio cue setting. Our decision to have two unique groups of players for the two phases is based on the following:
First, we want to prevent players from having psychological expectations before the game. Players with experience in the first phase may correctly guess the objective of our experiment. They could adopt a learned playing strategy if asked to participate in the second phase.
In addition, we need a set of prior data to compute the initial recommended level in the second phase. For this purpose, we use the players’ performance from the first phase as our prior data. As stated in the first reason, it would be inappropriate for our experiment to have the same players participate in both phases.
During each round, we record player data at one-second intervals, each consisting of their in-game coordinates, orientation, remaining health, and ammunition count. These data are essential for facilitating subsequent data processing and analysis, enabling us to replay and analyze each player’s trajectory.
Figure 1 provides an overall workflow of our experiment.
Findings from the field of sports medicine and exercise science [
40] have suggested that an excessively prolonged experimental procedure might lead to player fatigue, potentially resulting in haphazard or arbitrary responses to a given questionnaire. Therefore, we limit the duration of our experiment to less than an hour per participant. One round consists of playing a game level and answering a questionnaire and takes about 8 to 10 min. Hence, it is reasonable for each player to participate in five rounds (about 40 to 50 min) and a final survey asking for general information (about 10 min).
4.1. Procedure
4.1.1. Phase One
In phase one, which serves as our baseline, players participated in an FPS game with a manually set difficulty curve. We established two distinct sets of manually implemented difficulty levels, with finer granularity than in our previous study [
10]. In the first set, called Group A, the enemy volume progresses from low to high, ranging from 50% to 100%, across five levels, with equal increments: 0.5, 0.625, 0.75, 0.875, and 1. Conversely, the second set, called Group B, involves a reverse progression in which the enemy volume decreases: 0.5, 0.375, 0.25, 0.125, and 0. The primary rationale for having two groups is based on the following considerations:
Firstly, due to our restriction on players participating in only five rounds of gameplay, it is deemed impractical to have a linear progression from 0 to 100% across five levels with a uniform increment. Such a dramatic variation in enemy volume between two consecutive levels would likely hinder the understanding of our collected data.
Secondly, dividing into two groups allows us to examine player experiences when encountering scenarios where the enemy volume increases from low to high and the opposite. Although this is not the primary focus of our experiment, it has the potential to uncover discrepancies in player experiences under different volume trends. Therefore, this may yield additional insights, enhancing the reliability of the audio cue volume design.
During phase one, players will be randomly assigned to one of the groups and will not be informed of their group assignment in advance. Players are expected to discern changes in audio cue volume on their own. This approach is implemented to avoid creating any psychological expectations in players prior to gameplay, as such expectations could compromise the reliability of the experimental data.
4.1.2. Phase Two
In phase two, we use GPR-DDA (Algorithm 1) to control the in-game audio cue volume. In our previous research, the goal recommendation algorithm was introduced to derive the player’s target performance (Equation (
1)) for the DDA mechanism, where
,
, and
are set to 0.5 [
10]. In this work, we follow this approach.
GPR-DDA recommends to the player a level with an audio cue volume setting at which the player’s actual or predicted performance is closest to the goal [
10]; however, if the player has already played such a level, a random level will be chosen [
10]. This mechanism ensures that the player will not encounter the same level repeatedly and expedites the discovery of the global optimum audio cue volume setting rather than local ones.
4.2. FPS Games
Our experimental game is an FPS that requires players to eliminate enemies on the map within a specific ammunition limit or locate the escape point, all while having limited health. This game simulates a scenario in which the player must escape an area under heavy siege. Players can choose to either eliminate all enemies to complete the level or find the escape point, regardless of the number of enemies they have eliminated. In this game, players need to listen attentively to the enemies’ audio cues to determine their positions for evasion or elimination.
Figure 2 illustrates a typical gameplay scene in our game.
In this study, we retained the FPS game mechanisms and the opponent AI algorithm used in our previous work [
10] (as detailed in
Appendix A), while introducing modifications to the game’s map design. The original game featured five small maps, each measuring 500 units by 500 units in size, in Unreal Engine. These maps were characterized by their simplicity, featuring a single linear path for progression and lacking a multi-floor design. We consider such a map design ill-suited for contemporary FPS games. In our previous study [
10], the small map size restricted player exploration and led to a lack of engagement, potentially impacting the accuracy of the player experience assessment.
To address these concerns, we developed a new game map with dimensions of 8000 units by 8000 units in Unreal Engine. This map is composed of two floors (
Figure 3), offering a significant increase in size, being 512 times larger than each map in the previous work. Furthermore, to align the map design more closely with commercial FPS games, we incorporated principles and concepts reminiscent of those found in commercially available games like the FPS game released by Ubisoft in 2015 called Tom Clancy’s Rainbow Six Siege.
It is essential to note that, in our previous work [
10], those five maps were crafted to prevent players from easily memorizing map layouts, thereby safeguarding the accuracy of the experiment. We maintain this perspective here. Consequently, in our newly designed larger map we incorporated eight distinct and randomized spawn points. In each game round, players are randomly assigned to one of these spawn points. We are aware that random allocation might occasionally result in players respawning at the same point. However, this approach aligns closely with the logic found in commercial games and complements our map design philosophy, resembling key elements of commercially available games.
When considering which in-game parameter can represent our players’ performance in our game, we discuss it in the next section. Our candidates are player life points, accuracy hit rate, and game clear time.
4.3. Metric for Player Performance
We implemented a common feature in modern commercial FPS games known as recoil, where the player’s crosshair moves upward after firing a shot. This feature requires players to control mouse movements to better aim at enemies while firing. We need to understand that novice and experienced players manage in-game recoil differently. This means that novice players have limited control over recoil, whereas experienced players can keep their crosshair within a smaller range while continuously firing. As a result, novice players may find it difficult to hit enemies, especially distant enemies, while experienced players may have a much higher hit rate. This implies that the hit rate is too uncontrollable to serve as an objective metric for assessing players’ performance. This is because improved game performance does not necessarily indicate a continuously increasing/declining value of hit rate.
In addition, based on the game logs collected from our previous study, we understand that players have varying play styles. Some players tend to seek out and engage enemies upon discovery, while others prefer to avoid combat and take alternative routes. Since we do not wish to restrict players’ play styles by using fixed route, as this could make them feel that the game’s design limits their creativity, simply calculating a player’s health cannot objectively represent their performance. Also, improved game performance does not necessarily indicate a continuously increasing/declining value of player’s health points.
We believe that only using clear time to represent player performance is more objective than using hit rate or player health. First, if experienced players can obtain more information from audio cues, they can eliminate enemies and complete the game more quickly. Second, we designed the game so that, upon death, players must restart from the beginning, and previously defeated enemies do not respawn. This means that players who enjoy combat will take longer to finish the game, while those who prefer to avoid enemies can find a safer path and complete the game more quickly by gathering more information from audio cues. Considering these aspects, only using clear time to assess a player’s performance individually will have a higher probability of showing a continuously decreasing value as game performance increases.
However, in this paper, we will not discuss the relationship between audio cue volume and player clear time again. This is because our previous study also used player clear time to assess performance, and this relationship has already been shown in that study [
10]. The main purpose of this paper is to reveal whether or not, under GPR-DDA, players can have a positive gaming experience.
4.4. Participant Recruitment
In this paper, we target the predominant gaming demographic interested in FPS games. To ensure a sufficient number of respondents for both phases, we recruited students from two different classes at our university as the main participants in our experiment. Furthermore, to increase the diversity of the respondents, we also conducted phase one using Amazon Mechanical Turk (AMT). In addition, we randomly distributed the experiment to volunteer members of the general public in phase two, ensuring that these participants had not taken part in phase one on AMT. Participation in our experiment was voluntary, except for those from AMT who received a payment of 0.01 dollar, in line with the minimum payment policy of the AMT platform. All participants in the experiment gave their informed consent, as detailed in
Section 5.1, beforehand. Additionally, all participants were different from those in our previous study.
4.5. Questionnaire
At the end of each round, our participants were required to complete a modified GUESS questionnaire in their phase. This questionnaire is crucial for capturing the gaming experience aspects of each round for every player, which allows us to construct individual experience curves over the course of five rounds of gameplay. Furthermore, it will enable us to compute the average gaming experience value within specific difficulty intervals, defined as audio cue volume ranges.
In
Table 1, we present our GUESS questionnaire, which employs a seven-point Likert scale scoring system (1 = Strongly Disagree, 7 = Strong Agree). It comprises 11 questions, including three questions for detecting whether or not the participants are providing fraudulent responses. These three questions incorporate two fraudulent respondent detection mechanisms and are not used in our gaming experience analysis.
The first question in the fraudulent respondent detection category in
Table 1 uses a consistency check. This means that each participant must consistently answer this question in all five rounds of the game. Inconsistent responses to this question in the five questionnaires make all the responses provided by the participant unreliable, and their data will be discarded.
The second and third questions in the fraudulent respondent detection category in
Table 1 use a logical inconsistency check. We obtain these two questions by negating two selected questions from the eight game experience questions. For each pair of original and negated questions, we consider their responses reliable when their sum ranges between six and ten and those with a sum value outside of this range unreliable. If a participant gives unreliable responses for at least one such pair in a given round, the participant’s response data for that round will be discarded.
It is worth noting that the order of the 11 questions in our GUESS questionnaire is randomized. This randomization aims to enhance the assessment of potential fraudulent responses from players. Following this data-cleaning process, we obtain valid GUESS questionnaire responses for subsequent statistical testing.
Additionally, our participants are required to complete a final survey at the end. Each will receive a unique ID at the end of the game. This ID will be used in the final survey to identify fraudulent respondents who, in this context, have not completed the game but have completed the survey. One purpose of the final survey is to gather some general information from them, including their age range, gender, and proficiency in playing FPS games.
General information will facilitate our data analysis by allowing us to investigate potential biases related to age, gender, or skill in playing FPS games, ensuring that we can discern whether or not certain findings are influenced by these factors. Concurrently, we also request players to provide feedback on their experiences while playing our game and any suggestions through the final survey. This feedback is intended to reveal additional aspects that may not have been adequately captured by GUESS regarding issues arising from changes in the volume of the audio cues of the game enemies.
5. Experiment
5.1. Informed Consent and Game Settings
Before starting the experiment, all participants needed to confirm their informed consent and read the instructions. For environmental settings, we explicitly mentioned that all participants were required to use their own devices (mouse, keyboard, and headphones/headset), and the Windows operating system was mandatory, as the game cannot be operated on other operating systems. Furthermore, we understand that extremely high audio volumes can make participants feel uncomfortable. Therefore, after asking all players to initially set their system volume to 100%, we asked them to adjust it to a suitable level during a designed tutorial level. If players adjusted the system volume during the tutorial level, we asked them to inform us of their system volume settings in our final survey.
5.2. Participants and Their Responses
Players who agreed to participate in the experiment were given anonymous access to our experimental game. Their performance data were uploaded to our database as they played. At the end of each round, they were required to complete our GUESS questionnaire built into the game, and the answers were also uploaded to our database. Internet access was required to play the game.
Our experiment consisted of two phases. In both phases, the participants were entirely different. A total of 80 people participated in the experiment, with 40 in each phase. However, some players completed only part of the game, and some did not fill out the final survey. After counting, with 40 players in phase one, we can identify 17 players with 70 GUESS responses from Group A, 16 players with 72 GUESS responses from Group B, and seven players who only completed their first assigned level, with an enemy volume of 0.5, and our GUESS questionnaire at the end of the level. In phase two, all 40 players participated, and we collected 136 GUESS responses.
Regarding the final survey, we received 38 responses in phase one and 34 responses in phase two. In both phases, no player informed us that they adjusted the system volume during the tutorial level. In addition, the final survey revealed that our participants came from various backgrounds: 48 university students, 22 non-students, and two users from AMT. They came from six different countries: the United States, China, Italy, Japan, Turkey, and Venezuela (listed in alphabetical order). Regarding gender, 62.5% of participants were male (phase one: 31, phase two: 19), 23.75% were female (phase one: 6, phase two: 13), 1.25% identified as other (phase one: 0, phase two: 1), and 12.5% preferred not to say or were unknown (phase one: 1, phase two: 1). Regarding age, 91.7% of our players were aged 18 to 29 (phase one: 35, phase two: 31), 6.94% were aged 30 to 39 (phase one: 2, phase two: 3), and 1.39% were over 40 (phase one: 1, phase two: 0). Regarding FPS ability, we used a scale of 0 (unfamiliar), 1 (somewhat familiar), and 2 (very familiar) to represent their proficiency. The results showed that 18.1% were unfamiliar (phase one: 7, phase two: 6), 43.1% were somewhat familiar (phase one: 18, phase two: 13), and 38.9% were very familiar (phase one: 13, phase two: 15). Although we cannot accurately determine the maximum decibel output of our players’ devices, based on the players’ age groups and their familiarity with FPS games, we can assume that there is no significant difference in the perception of audio in FPS games between players in both phases.
Table 2 presents the responses and provides detailed insights into the distribution characteristics of the players in both phases. Based on the general information provided by the final survey, the
p-value of the Shapiro–Wilk test (phase one: 5.201 × 10
−6; phase two: 8.693 × 10
−6) shows that the player distributions in both phases with respect to the FPS ability do not follow a normal distribution. As the respondents from the two phases involve different groups of people, we used the Mann–Whitney U test to assess the significance in difference between the respondents from the two phases. The
p-value of the Mann–Whitney U test is 0.67, i.e., not statistically significant. This situation demonstrates that, despite consisting of different individuals, their abilities are similar. Therefore, comparing the GUESS values between phase one and phase two is dependable.
After the data cleaning process, using the fraudulent respondent detection mechanisms for our GUESS questionnaire described in
Section 4.5, we were left with valid data. In phase one, we received 100 out of 142 valid GUESS questionnaire responses, each a set of answers, from 30 players, and in phase two we received 111 out of 136 such responses from 29 players. For the final survey, we identified one duplicate response from phase one and five fraudulent respondents from phase two, which we removed. Therefore, the final survey had 66 valid respondents (37 from phase one and 29 from phase two). Our data are accessible on the supplementary page (
https://github.com/Lxx007/Anonymous (accessed on 5 November 2024)).
5.3. Metrics for Questionnaire Data
To examine whether or not the difference in the GUESS values for the players in phase one and phase two is statistically significant, we first calculate the average GUESS value—with the lowest score of 8 points and the highest of 56 points—for each player over all the rounds played by the player. After that, we use the Shapiro–Wilk test to check for a normal distribution of the data. If the GUESS values of players in both phases follow a normal distribution, we use the unpaired t-test; otherwise, we use the Mann–Whitney U test. These tests are used to determine which phase offers a better gaming experience for our participants.
In addition, we examine whether or not the GUESS value of each phase is statistically and significantly different from the GUESS mid-value of 32. Depending on the results of the Shapiro–Wilk test for each phase’s GUESS value, we use either the one-sample t-test or the Wilcoxon signed-rank test. These tests are used to examine whether or not the gaming experience of players in each phase significantly surpasses the mid-value of 32.
5.4. Metrics for Open-Ended Player Responses
Our final survey results consist of both English and Chinese languages due to the diverse origins of our participants, spanning six different countries, including non-English native speakers. As a result, we tasked three suitable human evaluators who are proficient in both English and Chinese with the analysis. With limited qualified human evaluators, we ask an LLM, ChatGPT (GPT-4), to help us with our classification task, described below, for open-ended player responses. In previous research [
36], an endeavor was made to employ ChatGPT to classify bullet comment types in games, achieving success. This shows that ChatGPT can be used for classification tasks. In addition, ChatGPT, as a powerful LLM, has the ability of translation tasks and reading comprehension tasks, which can be used in multilingual semantic analysis for our classification task. Therefore, ChatGPT and human evaluators are employed to classify all player responses from the final survey. The objective of this analysis is to uncover the reasons behind the objective results.
However, the same previous research also mentioned that both human evaluators and ChatGPT will have inconsistent answers in some complex sentences. To mitigate this issue, as well as the effect of hallucinations present in ChatGPT, we have designed the following procedure:
For ChatGPT, we use it with an originally crafted prompt for classification, running it three times for each classification instance. We consider a result of interest reliable if and only if it is consistent across all three runs; otherwise, it is deemed unreliable and labeled as 0.
For our human evaluators, we compare their responses. For each classification instance, if they all give the same answer, we consider the answer reliable; otherwise, it is deemed unreliable and labeled as 0.
Afterwards, we combine the results of ChatGPT and the human evaluators as follows:
If the result from the human evaluators is labeled as 0 but that from ChatGPT is not, we select ChatGPT’s result.
If both results from ChatGPT and the human evaluators are labeled as 0, the result is categorized as 0.
Otherwise, we select the result from the human evaluators. In cases where both results from ChatGPT and the human evaluators are not labeled as 0, we prioritize the response of the human evaluators because we consider them to be more reliable than LLMs.
To gain clarity on our players’ feelings about the audio cue design in each phase, we categorize all open-ended player responses from the final survey into three categories:
Prompt 1 shows the prompt. It is used by ChatGPT (via OpenAI API) and our human evaluators to analyze the open-ended player responses, individually shown under @paragraphs. This ensures that both ChatGPT and the human evaluators performed this classification task based on the same instruction.
Prompt 1: Prompt for analyzing open-ended responses from the participants |
Please classify the following @paragraphs into three categories based on their content related to audio cues in the game. If the paragraph doesn’t clearly and directly express a positive or negative sentiment towards the game’s audio cues, please choose A. If the paragraph clearly and directly expresses a negative sentiment towards the game’s audio cues, please choose B. If the paragraph clearly and directly expresses a positive sentiment towards the game’s audio cues, please choose C. If the text doesn’t mention audio cues, it means there are no complaints or compliments about audio cues. Please avoid overinterpreting. A: Not mentioned B: The paragraph directly mentions audio cues in the game and complains about them C: The paragraph directly mentions audio cues in the game and praises them Please directly tell me the classification; there is no need to explain the reasons within. @paragraphs It was fun at first. However, FPS games sometimes make me sick. So it’s no longer comfortable to play.
|
6. Results and Discussions
6.1. Results and Discussions of Questionnaires Data
The average GUESS value for phase one, which consists of 100 responses, is 32.85. For phase two, which has 111 responses, the average GUESS value is 38.58. The average GUESS value in phase two is higher than that in phase one, indicating that, on average, the players in phase two prefer the level recommendation mechanism. In addition to the average value, we statistically examined which phase can provide players with a superior gaming experience.
Table 3 shows that the average GUESS values of the players in phase two follow a normal distribution, while those in phase one do not. Therefore, we used a Mann–Whitney U test to assess the significance of the difference in the GUESS results between the two phases. The
p-value of the Mann–Whitney U test shows a significant difference between phase one and phase two, with the latter outperforming the former.
Moreover, we compared each group’s GUESS value with the midpoint value of 32. The Wilcoxon signed-rank test reveals that the GUESS value for the players in phase one does not significantly differ from the midpoint value. However, the average GUESS value for the players in phase two significantly exceeds the midpoint value of 32, as shown by a one-sample t-test.
In summary, the average GUESS value in phase two was higher than that in phase one, indicating that the players in phase two had a better gaming experience. In addition, it was significantly higher than the midpoint value of 32. Consequently, these results indicate that players will have a better gaming experience in an FPS game with GPR-DDA.
In
Figure 4, we present the trend in our GUESS values across the two phases. The dark green and dark blue lines represent the trend of the average GUESS values in the two phases. Dots on the dark blue line represent the selected audio cue volumes in phase one. The light-colored area indicates the region where all error bands are connected using smooth lines. The granularity of the audio cue volume in phase two was finer than in phase one, with a similar population of participants as in phase one, meaning that some levels might have been played by only one person. This led to the error band area in phase two not being continuous. In phase two, the maximum and minimum audio cue volumes provided by GPR-DDA were 82 and 3, respectively, resulting in a shorter data range than in phase one.
From
Figure 4, we observe that the overall trend of GUESS values in phase two is higher than in phase one, indicating that players found the game more enjoyable under the control of GPR-DDA in phase two. In addition, we found that the average GUESS values in the ranges of 20 to 30 and 45 to 70 in phase two were not only higher than those in phase one but also more stable, which may indicate that these two ranges are the most reasonable settings for our players. Moreover, there is a noticeable decline when the audio cue volume exceeds 70, with a sharp peak around 80, which can be considered an outlier. We speculate that an audio cue volume above 70 causes significant discomfort for players, leading them to consider the game as less enjoyable.
Our results support the idea that GPR-DDA can provide a better gaming experience than linearly changing enemy audio cue volume. These results answer our RQ 1: Can players have a positive gaming experience when a DDA mechanism is implemented through game volume? Therefore, when considering the goals of a DDA mechanism, balancing both designer preference and player performance is preferable. As in our previous research, three coefficients have been introduced to allow designers to adjust the weights of DDA on designer preference and player performance to fit their games. Our results support that players will have a positive gaming experience when , , and are set to 0.5. In addition, we posit that it will be difficult to generalize the positive gaming experience to all players if the coefficients are set to extremes, e.g., is set to 0, by which only designer preference is considered. In summary, setting all the coefficients to 0.5 is reasonable, but extreme weighting is not advisable. We highly recommend that the coefficients be adjusted accordingly for each game project.
In our procedure outlined in
Section 5.4, a given instance will be labeled as A, B, or C when multiple human evaluators unanimously select a label of interest, or when this does not happen, multiple rounds of ChatGPT consistently select it. This procedure ensures that only player feedback with such a label is considered in the following analysis (
Table 4), increasing the reliability of the findings. The consistency among the responses from our three human evaluators is 81.25%, while that among the three rounds of responses from ChatGPT is 90.28%. Furthermore, ChatGPT’s responses have shown a high similarity of 75% to those of its human counterparts. These results demonstrate the potential of ChatGPT in providing consistent and human-like answers. As a result, we recommend applying the procedure in
Section 5.4 when using LLMs such as ChatGPT to assist with classification tasks when human evaluators fail to reach a consensus.
Table 4 shows the results of our analysis, revealing interesting trends. During phase one, 9 out of 37 final survey respondents reported complaints, accounting for 24.32% (e.g.,
“The absence of enemy movement sound effects makes it easy for players to be unaware of their positions and get killed.”). In phase two, 7 out of 29 final survey respondents complained, accounting for 24.13%, slightly less than in phase one (e.g.,
“I am highly sensitive to loud noises and as instructions stated I played the game with headphones at max volume. It felt really claustrophobic.”). However, during phase two, one player explicitly praised the in-game audio cues, which did not occur during phase one (e.g.,
“Besides, the sound of the footsteps is a key mechanic in the game, it’s cool.”). The results indicate that, in phase two, players had fewer complaints about the enemy audio cue design and expressed more positive feedback. We argue that this is a significant factor in the overall improved player experience in phase two compared to phase one, according to the results from the modified GUESS questionnaire.
6.2. Results and Discussions of Open-Ended Player Responses
To answer our RQ 2 (How does the game volume affect the player’s gaming experience?) in detail, we conducted a manual review of the primary reasons for player complaints. Out of the nine complaining players in phase one, two found the audio cue volume too low (e.g., Players should be able to hear enemy footsteps when they are nearby.), four found it too high (e.g., Sound effects are crucial in FPS games. After playing, my ears feel overwhelmed.), and three thought that the sound effects needed improvement. In contrast, out of the seven complaining players in phase two, five requested more realistic sound effects, particularly for footsteps (e.g., It is recommended to modify the sound of the robot’s movement.), which is not the main focus of this study, while only two felt that the audio cue volume was too high (e.g., The volume is too loud, and it’s uncomfortable because I don’t know how to adjust the mouse sensitivity.).
As a result, we argue that having more varied complaints in a DDA setting tends to indicate that the DDA setting provides a worse gaming experience. Regarding feedback about unrealistic sound, only one player in phase one specifically mentioned poor footstep simulation, which limits further discussion. However, complaints about the volume being too loud were very prominent, indicating that players generally dislike games with high volume levels. In addition, from the GUESS trend in phase two, we observed that once the volume exceeds 70, players’ GUESS values drop sharply. Based on this and players’ feedback, and given that we do not know the exact dB values for the players, we should keep the game volume below 70 in the absence of such information.
7. Conclusions
This paper explored whether or not GPR-DDA can enhance the gaming experience of players and how audio cue volume effects it. A comparison of the GUESS values of the players in phases one and two revealed that GPR-DDA, used in phase two, significantly improves the player experience. Moreover, based on player feedback from the final survey, we identified why GPR-DDA is superior to the baseline used in phase one. Based on these findings, we draw the following conclusions.
Firstly, because the relationship between gaming experience and audio cue volume is highly personalized, it is challenging to manually adjust the game audio cue volume for each player to achieve not only optimal performance but also a better gaming experience. However, our results suggest that choosing a lower volume of audio cues might provide a better experience. More importantly, GPR-DDA can improve players’ individual gaming experiences in an FPS game.
Secondly, we extend the findings from our previous work [
10]. Our previous study suggested that a moderate audio cue volume (around 0.5) leads to higher player performance. In the current study, our findings indicate that players prefer a lower audio cue volume, which results in a better gaming experience. Therefore, we infer that using audio cue volume to control game difficulty does not necessarily result in a better gaming experience for each individual, even if it improves their performance. As a result, for an FPS game, we recommend initializing the audio cue volume at a low value (less than 0.5) for players who prioritize gaming experience and around 0.5 for players who prioritize performance. In addition, the coefficients in Equation (
1) should be adjusted accordingly for GPR-DDA to improve both player performance and gaming experience.
Thirdly, our analysis of the players’ feedback from the final survey suggests that the variety of complaints can serve as a metric to evaluate player experience. Additionally, the proposed procedure in
Section 5.4 allows language models such as ChatGPT to reliably assist humans in performing classification tasks for natural language analysis. Finally, since the game used in this study aligns its design philosophy and game mechanisms with those of commercial 3D FPS games, our study has potential applications for such games on the market.
8. Limitation and Future Work
FPS games demand that players focus more on audio cues, which provide strategic information to gameplay rather than merely serving as ambiance or decoration. This study acknowledges its potential limitations beyond the FPS genre, such as in role-playing, racing, or fighting games, where the significance of audio cues might be less pronounced. In addition, our game is a 3D game, and the perception of audio in 3D and 2D games may be different. As a result, more detailed studies are required for other game genres and 2D games.
Furthermore, this study focuses solely on controlling the volume of audio cues. However, detailed information such as decibels, sound pressure levels, and loudness is not provided here. We will test and present this information to enhance the audio design. Additionally, we will explore a more detailed player perception of audio loudness in the future, considering adaptation, order, and summation during auditory stimuli. This approach will provide designers with precise statistical data to inform their audio design in games. Our future research will focus on extending this study to various game genres and 2D games to emphasize the ongoing exploration of audio cue significance, revealing decibels, sound pressure levels, and loudness. This should contribute to the advancement of audio cue studies, whether aimed at enhancing gaming enjoyment, educational significance, or aiding visually impaired individuals.