\addbibresource

references.bib

An Evolutionary Framework for Connect-4 as Test-Bed for Comparison of Advanced Minimax, Q-Learning and MCTS

Henry Taylor and Leonardo Stella H. Taylor and L. Stella are with the School of Computer Science, College of Engineering and Physical Sciences, University of Birmingham, Birmingham, B15 2TT, U.K. (email: [email protected], [email protected]).
Abstract

A major challenge in decision making domains with large state spaces is to effectively select actions which maximize utility. In recent years, approaches such as reinforcement learning (RL) and search algorithms have been successful to tackle this issue, despite their differences. RL defines a learning framework that an agent explores and interacts with. Search algorithms provide a formalism to search for a solution. However, it is often difficult to evaluate the performances of such approaches in a practical way. Motivated by this problem, we focus on one game domain, i.e., Connect-4, and develop a novel evolutionary framework to evaluate three classes of algorithms: RL, Minimax and Monte Carlo tree search (MCTS). The contribution of this paper is threefold: i) we implement advanced versions of these algorithms and provide a systematic comparison with their standard counterpart, ii) we develop a novel evaluation framework, which we call the Evolutionary Tournament, and iii) we conduct an extensive evaluation of the relative performance of each algorithm to compare our findings. We evaluate different metrics and show that MCTS achieves the best results in terms of win percentage, whereas Minimax and Q-Learning are ranked in second and third place, respectively, although the latter is shown to be the fastest to make a decision.

I INTRODUCTION

Recent successes of game artificial intelligence (AI) have sparked increasing interest in the research community. Examples include Deep Blue, beating the ruling World Chess Champion Grandmaster Gary Kasparov in 1997, and Deepmind AlphaGo [1], beating European Go Champion Fan Hui and Go Master Lee Sedol in a five-game match. These recent achievements have led to a focus on General Game Playing (GGP) algorithms where single agents can learn multiple games to super-human level. The problem that arises when comparing these different algorithms is the lack of consistency in methodologies.

The aim of this paper is to provide a thorough analysis of a range of algorithms including Q-learning, Minimax and Monte Carlo Tree Search (MCTS), as well as their more advanced counterparts. To address the issue of inconsistency in the analysis, we carry out the comparison of these approaches across a single game domain, i.e., Connect-4. Connect-4 is a two-player zero-sum perfect information game played on a vertical board of size 6×7676\times 76 × 7 - consisting of over 4.5 trillion game states. The outcome of the game for either player can be win, lose or draw; these conditions are referred to as terminal states and a win is awarded if four pieces are connected horizontally, vertically or diagonally [kang2019, 3]. Connect-4 was chosen as a test-bed for comparison in line with the large body of works that constitute the literature review of this paper as in the following.

Connect-4 was first independently solved by Allen and Allis in 1988 via two different approaches [4]. Allen used a Brute-Force Depth-First search while Allis built a program called VICTOR which used knowledge-based strategic rules based on the Chess concept of Zugzwang [4]. More recently, the authors in [5] applied Q-Learning with Epsilon-Greedy policy to small board sizes for Connect-4 and other games. In [kang2019], different Minimax payoff functions in Connect-4 were investigated, resulting in complex heuristics being advantageous at deeper Minimax searches. In [6], Minimax with Alpha-Beta pruning cuts was found to be an improvement over the base Minimax algorithm for Connect-4. The authors in [Scheiermann_2022] applied a Reinforcement Learning (RL) agent with a MCTS wrapper to several games including Connect-4. They showed that a base MCTS agent (with random play-outs and 10,000 iterations) had a win rate of \approx100% against the RL agent, \approx25% against a near perfect Minimax Alpha-Beta agent, and \approx1% against the RL wrapped with MCTS (inspired by AlphaZero) agent for Connect-4. Therefore the AlphaZero-inspired approach performs the best, followed by a Minimax agent with Alpha-Beta cuts, then MCTS, and lastly, the classical RL agent.

Finally, in [8] it was shown that the best Connect-4 algorithm was MCTS, followed by Deep-Double Q-Learning then Minimax with Alpha-Beta pruning cuts. This was a surprising conclusion given that correctly configured Minimax algorithms play optimally [6, kang2019, 9]. Furthermore, the Minimax payoff function is less sophisticated than existing literature sources (see [6, kang2019]) and the comparison between the algorithms was made on limited analysis suggesting incorrect conclusions.

It is hard to compare results from [8] and [Scheiermann_2022] due to either the differences in algorithms formation (for example the payoff function in Minimax), or the differences in methodology (an algorithm versus algorithm approach in [8] as opposed to a base control algorithm as a comparison in [Scheiermann_2022]). It is likely that in these differences lie the key to the variation in results. Another critical aspect was the limited evaluation. What if an algorithm plays perfectly but takes long time to make a move (see [10])? What if an algorithm plays strong against another but can not account for the variability of moves in a random agent (see [11])?

Motivated by these shortcomings, the contribution of this paper is threefold. First, we implement and compare advanced versions of the algorithms under review present in the literature to the case of Connect-4, enabling comparison with consistency over the methodology, branching factor and rules. Second, inspired by Axelrod’s tournament [12] for the Prisoner’s Dilemma game, we introduce a novel methodology for comparative research, the Evolutionary Tournament. Third, we provide additional evaluation of each algorithm through a selection of classical evaluation approaches.

This paper is organized as follows. In Section II, we introduce each approach, their advanced counterpart and the evaluation methods. In Section III, we present results and discuss the implications of each of the methods on the comparison. Finally, in Section IV, we draw conclusions and present our future research directions.

II Algorithms

In this section, we introduce the design and implementation of the core algorithms used as candidates for the comparative analysis in Connect-4.

Q-Learning. Q-Learning is a temporal-difference (TD) algorithm where a Markov Decision Process (MDP), a framework for sequential decision problems [13], is considered. An MDP is defined as MDP(𝒮,𝒜,P(s,s,a),R(s,s,a))𝑀𝐷𝑃𝒮𝒜𝑃superscript𝑠𝑠𝑎𝑅superscript𝑠𝑠𝑎MDP(\mathcal{S},\mathcal{A},P(s^{\prime},s,a),R(s^{\prime},s,a))italic_M italic_D italic_P ( caligraphic_S , caligraphic_A , italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a ) , italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a ) ), where 𝒮𝒮\mathcal{S}caligraphic_S is the set of possible states, 𝒜𝒜\mathcal{A}caligraphic_A is the set of possible actions, P(s,s,a)=P(st+1=s|st=s,at=a)P(s^{\prime},s,a)=P(s_{t+1}=s^{\prime}|s_{t}=s,a_{t}=a)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a ) = italic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ) is the transition model that maps a new state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from state s𝑠sitalic_s through action a𝑎aitalic_a via one-step state transition dynamics, and R(s,s,a)=𝐄[rt+1|st=s,at=a,st+1=s]𝑅superscript𝑠𝑠𝑎𝐄delimited-[]formulae-sequenceconditionalsubscript𝑟𝑡1subscript𝑠𝑡𝑠formulae-sequencesubscript𝑎𝑡𝑎subscript𝑠𝑡1superscript𝑠R(s^{\prime},s,a)=\mathbf{E}[r_{t+1}|s_{t}=s,a_{t}=a,s_{t+1}=s^{\prime}]italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a ) = bold_E [ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] describes the reward function associated with the transition model obtained by the environment.

The goal in an MDP is to find a policy π𝜋\piitalic_π that maximises the expected reward over a given time horizon. This is described by equation (1), which is referred to as the Bellman equation:

Vπ(s)=aπ(s,a)sP(s|s,a)[R(s,s,a)+γVπ(s)],superscript𝑉𝜋𝑠subscript𝑎𝜋𝑠𝑎subscriptsuperscript𝑠𝑃conditionalsuperscript𝑠𝑠𝑎delimited-[]𝑅superscript𝑠𝑠𝑎𝛾superscript𝑉𝜋superscript𝑠V^{\pi}(s)=\sum_{a}\pi(s,a)\sum_{s^{\prime}}P(s^{\prime}|s,a)[R(s^{\prime},s,a% )+\gamma V^{\pi}(s^{\prime})],italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_s , italic_a ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) [ italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a ) + italic_γ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (1)

where the Vπ(s)superscript𝑉𝜋𝑠V^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) is the value function that represents the expected utility under policy π𝜋\piitalic_π when starting in state s𝑠sitalic_s and γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor, which ensures that the sum is bounded in an infinite horizon.

Solving Equation 1 gives the optimal policy (mapping between the states and actions) [9]. This is achieved by using the observed transitions to adjust the Q-Function values over time using the TD update equation:

Q(s,a)Q(s,a)+α[R(s,s,a)+γmaxaQ(s,a)Q(s,a)],𝑄𝑠𝑎𝑄𝑠𝑎𝛼delimited-[]𝑅superscript𝑠𝑠𝑎𝛾subscript𝑎𝑄superscript𝑠superscript𝑎𝑄𝑠𝑎Q(s,a)\leftarrow Q(s,a)+\alpha[R(s^{\prime},s,a)+\gamma\max_{a}Q(s^{\prime},a^% {\prime})-Q(s,a)],italic_Q ( italic_s , italic_a ) ← italic_Q ( italic_s , italic_a ) + italic_α [ italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a ) + italic_γ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( italic_s , italic_a ) ] , (2)

where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] is the learning rate. In Q-Learning these values are recorded into a Q-Table for each state-action pair, (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). A single Q-Table is used for our Q-Learning agents regardless of the position they play. During training each player will not have access to the alternative players Q-Table values as player I/II indexes the Q-Table based on states with even/odd number of moves. If a state-action pair is not visited then the respective utility value will not be stored in the Q-Table and instead represented as 0 or a missing index. Therefore, its vital that training has a balance between exploration and exploitation of the state space which an Epsilon-Greedy training policy promotes. Actions are selected at random with a probability equal ϵitalic-ϵ\epsilonitalic_ϵ, a parameter, or selected according to a greedy policy [14] defined as argmaxaQ(s,a)𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑎𝑄𝑠𝑎argmax_{a}Q(s,a)italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a ) on the Q-Table [9], with probability equal to 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ. This policy has been shown to be effective at training Q-Learning agents within Connect-4 contexts [5]. The Q-learning algorithm with epsilon-greedy policy is provided in Algorithm 1.

Algorithm 1 Q-Learning with Epsilon-Greedy
Input: ϵitalic-ϵ\epsilonitalic_ϵ, State s𝑠sitalic_s, Q-Table Q(s,action)𝑄𝑠𝑎𝑐𝑡𝑖𝑜𝑛Q(s,action)italic_Q ( italic_s , italic_a italic_c italic_t italic_i italic_o italic_n ), α𝛼\alphaitalic_α, γ𝛾\gammaitalic_γ
Output: action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n, updated Q-Table Q(s,action)superscript𝑄𝑠𝑎𝑐𝑡𝑖𝑜𝑛Q^{\prime}(s,action)italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a italic_c italic_t italic_i italic_o italic_n )
1::1absent\quad 1:1 : If ϵitalic-ϵ\epsilonitalic_ϵ >>> randomValue:
2::2absent\quad 2:\quad2 : action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n = randomMove
3::3absent\quad 3:3 : If ϵitalic-ϵ\epsilonitalic_ϵ \leq randomValue:
4::4absent\quad 4:\quad4 : action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n = argmaxaQ(s,action)𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑎𝑄𝑠𝑎𝑐𝑡𝑖𝑜𝑛argmax_{a}Q(s,action)italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a italic_c italic_t italic_i italic_o italic_n )
5::5absent\quad 5:5 : ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Function UpdateState(s𝑠sitalic_s, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n)
6::6absent\quad 6:6 : R𝑅Ritalic_R = Function Reward(ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, s𝑠sitalic_s, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n)
7::7absent\quad 7:7 : FR𝐹𝑅FRitalic_F italic_R = Function FutureReward(ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n):
8::8absent\quad 8:\quad8 : action=argmaxaQ(s,action)𝑎𝑐𝑡𝑖𝑜superscript𝑛𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑎𝑄superscript𝑠𝑎𝑐𝑡𝑖𝑜𝑛action^{\prime}=argmax_{a}Q(s^{\prime},action)italic_a italic_c italic_t italic_i italic_o italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a italic_c italic_t italic_i italic_o italic_n )
9::9absent\quad 9:\quad9 : ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Function UpdateState(ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, action𝑎𝑐𝑡𝑖𝑜superscript𝑛action^{\prime}italic_a italic_c italic_t italic_i italic_o italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
10::10absent\quad 10:\quad10 :return MAX(Q(s,action))𝑀𝐴𝑋𝑄superscript𝑠𝑎𝑐𝑡𝑖𝑜superscript𝑛MAX(Q(s^{\prime},action^{\prime}))italic_M italic_A italic_X ( italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a italic_c italic_t italic_i italic_o italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
11::11absent\quad 11:11 : Q(s,action)superscript𝑄𝑠𝑎𝑐𝑡𝑖𝑜𝑛Q^{\prime}(s,action)italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a italic_c italic_t italic_i italic_o italic_n ) = Function UpdateQ(α𝛼\alphaitalic_α, γ𝛾\gammaitalic_γ, R𝑅Ritalic_R, FR𝐹𝑅FRitalic_F italic_R)
12::12absent\quad 12:12 : return Q(s,action)superscript𝑄𝑠𝑎𝑐𝑡𝑖𝑜𝑛Q^{\prime}(s,action)italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a italic_c italic_t italic_i italic_o italic_n ), action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n

Q-Learning has three hyperparameters, namely, a discount rate denoted by γ𝛾\gammaitalic_γ, a learning rate denoted by α𝛼\alphaitalic_α and ϵitalic-ϵ\epsilonitalic_ϵ. We set γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 after testing a range of values against a random agent. Parameters ϵitalic-ϵ\epsilonitalic_ϵ and α𝛼\alphaitalic_α are time-varying parameters that decay as a function of training iterations. This was shown to be effective in training Q-Learning agents in [5] for epsilon decay and [9, 11, 15] for learning rate decay. The reward function, R(s,s,a)𝑅𝑠𝑠𝑎R(s’,s,a)italic_R ( italic_s ’ , italic_s , italic_a ), was designed in accordance with the Connect-4 literature [6, 8, kang2019]: Player win/lose (+40/-30), 1 piece in a row (+1), 2 pieces in a row (+2), 3 pieces in a row (+6), Column 2 and 6 placement (+1), Column 3 and 5 placement (+2), and Column 4 placement (+4). In the case where more than one condition is met, the values are summed.

We are now ready to introduce the first part of the first contribution of the paper, the comparison of base algorithms with enhancements. This paper tested whether the presence of an expert player during training improved the Q-Learning agents performance. As far as we are aware this is the first investigation of the type of enhancement applied to Connect-4. A Q-Learning agent was trained with half of its games played against a Minimax agent compared to another Q-Learning agent which played all of its games against itself. Minimax was selected as an expert player because it plays optimally if correctly configured [9]. These two Q-Learning agents then played 100 games with the Q-Learning (Minimax) agent winning 61% of its games. These results contradict those of [15] who found a Q-Learning agent was stronger when trained against a random player compared to more experienced advisories. A possible explanation for this is offered by [11] which showed that surprising moves make a more challenging adversary. Therefore, the involvement of an expert player coaches the agent to respond to a better set of moves making the states experienced more varied and preparing the agent better. The Q-Learning (Minimax) agent was then re-trained for 100,000 games as player I and 100,000 games as player II totalling 200,000 games of experience, with the first 2000 games against a Minimax opponent and the rest through self-play. This constituted 1,293,018 unique Connect-4 states experienced across both player positions. Once the agent was trained, a Greedy policy was used for agent inference as this has been shown to be the optimal policy [14].

Minimax. In a two-player game, the Minimax value, determined by equation (3), is the smallest payoff that the other player can guarantee the player will receive. The maximin value, determined by equation (4), is the largest payoff which the player can guarantee without knowing the other player’s actions [16]. These are defined as:

v¯=minaIAImaxaIIAIIu(aI,aII),¯𝑣subscriptsubscript𝑎𝐼subscript𝐴𝐼subscriptsubscript𝑎𝐼𝐼subscript𝐴𝐼𝐼𝑢subscript𝑎𝐼subscript𝑎𝐼𝐼{\overline{v}}=\min_{a_{I}\in A_{I}}\max_{a_{II}\in A_{II}}{u(a_{I},a_{II})},over¯ start_ARG italic_v end_ARG = roman_min start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_a start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT ) , (3)
v¯=maxaIAIminaIIAIIu(aI,aII),¯𝑣subscriptsubscript𝑎𝐼subscript𝐴𝐼subscriptsubscript𝑎𝐼𝐼subscript𝐴𝐼𝐼𝑢subscript𝑎𝐼subscript𝑎𝐼𝐼{\underline{v}}=\max_{a_{I}\in A_{I}}\min_{a_{II}\in A_{II}}{u(a_{I},a_{II})},under¯ start_ARG italic_v end_ARG = roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u ( italic_a start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT ) , (4)

where AI,AIIsubscript𝐴𝐼subscript𝐴𝐼𝐼A_{I},A_{II}italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT are the sets of strategies for player I and player II, respectively, and u()𝑢u(\cdot)italic_u ( ⋅ ) is the payoff function, representing the payment that player II makes to player I. For zero-sum two-player games such as Connect-4, maximising a players pay off is the same as minimising the opponent’s payoff. Consider UI+UII=0subscript𝑈𝐼subscript𝑈𝐼𝐼0U_{I}+U_{II}=0italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_U start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT = 0 where UIsubscript𝑈𝐼U_{I}italic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is utility which sums to zero representing one player with positive utility (winning) while the other with negative utility (a loss). Replacing each players individual utility with a shared utility, U𝑈Uitalic_U, representing a payment player I makes to player II we get UI=Usubscript𝑈𝐼𝑈U_{I}=Uitalic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_U and UII=Usubscript𝑈𝐼𝐼𝑈U_{II}=-Uitalic_U start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT = - italic_U [16]. Therefore, u()𝑢u(\cdot)italic_u ( ⋅ ) captures the payoff of arriving at a state from player I’s perspective. Equations 3 and 4 are equivalent to the Nash equilibrium of the game [16].

Minimax search builds on these concepts by applying a recursive algorithm for selecting the best move in a 2-player (or n-player) game. This materialises as a plan from the perspective of two players (for Connect-4) which is captured in a sequential game tree structure detailing all possible moves from both players perspective. Each layer in the tree is ether player I or II. Due to the equations UI=Usubscript𝑈𝐼𝑈U_{I}=Uitalic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_U and UII=Usubscript𝑈𝐼𝐼𝑈U_{II}=-Uitalic_U start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT = - italic_U, player I tries to maximise u()𝑢u(\cdot)italic_u ( ⋅ ) while player II tries to minimise u()𝑢u(\cdot)italic_u ( ⋅ ) [16]. In other words, the algorithm attempts to find the optimal move for each player given the assumption that the other player is playing optimally.

A depth-first search is performed on the game tree beginning by considering the available actions at S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the player. For example, if column 7 is full then the possible actions range from placing a piece in columns 1-6 from S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The algorithm chooses an action to get to a second state S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and then considers the list of available actions for the other player. Once an action has been selected, a new state is reached, S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where the algorithm once again considers the available actions for the player. This sequential process continues until a terminal state (win/draw/loss) or search depth, set by d𝑑ditalic_d, is reached [9]. For example, Minimax will search to 2-ply if d=2𝑑2d=2italic_d = 2 meaning in a sequential two player game the algorithm searches all game states where player I makes a move followed by player II making a return move.

Once the sequential process ends, u()𝑢u(\cdot)italic_u ( ⋅ ) is called to evaluate the payoff in that game state with respect to player I. This payoff is feed backwards up the tree to the root node of the game tree, S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Dependent on the node type the payoff is fed to would result in differing values becoming populated in the tree. A player I/II node would favour higher/lower payoffs, calling equations (4) or (3). This is because each player chooses the action (aIsubscript𝑎𝐼a_{I}italic_a start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT or aIIsubscript𝑎𝐼𝐼a_{II}italic_a start_POSTSUBSCRIPT italic_I italic_I end_POSTSUBSCRIPT) that rewards them, so player I/II would choose actions which maximise/minimise u()𝑢u(\cdot)italic_u ( ⋅ ). We backtrack to S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at which point the whole algorithm is repeated from the second action from the original six actions and again until all actions are explored. Once game tree is complete to a depth equal to d𝑑ditalic_d the algorithm picks the action, aIsubscript𝑎𝐼a_{I}italic_a start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, which satisfies Equation 4 and represents maximised u()𝑢u(\cdot)italic_u ( ⋅ ) [9]. The Minimax algorithm is provided in Algorithm 2.

Algorithm 2 Minimax
Input: State s𝑠sitalic_s, depth d𝑑ditalic_d, MaxBool𝑀𝑎𝑥𝐵𝑜𝑜𝑙MaxBoolitalic_M italic_a italic_x italic_B italic_o italic_o italic_l
Output: action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n
1::1absent\quad 1:1 : If Function IsTerminal(s𝑠sitalic_s):
2::2absent\quad 2:\quad2 : return (Move, Function ScoreState(s𝑠sitalic_s))
3::3absent\quad 3:3 : ElseIf MaxBool == True:
4::4absent\quad 4:\quad4 : value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e = -\infty
5::5absent\quad 5:\quad5 : For action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n in availableActions:
6::6absent\quad 6:\quad\quad6 : ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Function UpdateState(s𝑠sitalic_s, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n)
7::7absent\quad 7:\quad\quad7 : newScore𝑛𝑒𝑤𝑆𝑐𝑜𝑟𝑒newScoreitalic_n italic_e italic_w italic_S italic_c italic_o italic_r italic_e = Function Minimax(s𝑠sitalic_s, d𝑑ditalic_d - 1, 0)
8::8absent\quad 8:\quad\quad8 : If score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e >>> value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e:
9::9absent\quad 9:\quad\quad\quad9 : value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e, move𝑚𝑜𝑣𝑒moveitalic_m italic_o italic_v italic_e = newScore𝑛𝑒𝑤𝑆𝑐𝑜𝑟𝑒newScoreitalic_n italic_e italic_w italic_S italic_c italic_o italic_r italic_e, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n
10::10absent\quad 10:\quad10 :return (move, score)
11::11absent\quad 11:11 : ElseIf MaxBool == False:
12::12absent\quad 12:\quad12 : value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e = \infty
13::13absent\quad 13:\quad13 : For action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n in availableActions:
14::14absent\quad 14:\quad\quad14 : ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Function UpdateState(s𝑠sitalic_s, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n)
15::15absent\quad 15:\quad\quad15 : newScore𝑛𝑒𝑤𝑆𝑐𝑜𝑟𝑒newScoreitalic_n italic_e italic_w italic_S italic_c italic_o italic_r italic_e = Function Minimax(s𝑠sitalic_s, d𝑑ditalic_d - 1, 1)
16::16absent\quad 16:\quad\quad16 : If score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e \leq value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e:
17::17absent\quad 17:\quad\quad\quad17 : value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e, move𝑚𝑜𝑣𝑒moveitalic_m italic_o italic_v italic_e = newScore𝑛𝑒𝑤𝑆𝑐𝑜𝑟𝑒newScoreitalic_n italic_e italic_w italic_S italic_c italic_o italic_r italic_e, action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n
18::18absent\quad 18:\quad18 :return (move, score)

Minimax has one parameter, d𝑑ditalic_d, which has a positive correlation with its strength but a negative correlation with run time. If depth is set to the maximum number of Connect-4 game moves (d=42𝑑42d=42italic_d = 42) it would play a perfect game but would suffer significant increases in search time. A trade-off between time and optimally therefore exists [6] which we will evaluate as part of section III. The payoff function was designed according to Connect-4 literature [6, 8, kang2019]: win (+\infty/-\infty), 1 in a row (+1/-1), 2 in a row (+2/-2), 3 in a row (+6/-6), column 2 and 6 placement (+1/-1), column 3 and 5 placement (+2/-2), column 4 placement (+4/-4) where values are formatted as (Max/Min). In the case where more than one is activated, these values are summed.

Continuing with our first contribution, Minimax was enhanced and compared to a variety of modifications beyond the base algorithm. Alpha-Beta Pruning has been shown to be a valuable modification for search efficiency [6] because large parts of the tree which do not affect the outcome are cut (i.e. pruned) by using acquired knowledge of explored sub-trees. This is moderated by two dynamic parameters, α𝛼\alphaitalic_α and β𝛽\betaitalic_β, which initially are set to α=𝛼\alpha=-\inftyitalic_α = - ∞ and β=𝛽\beta=\inftyitalic_β = ∞. As the search progresses, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are updated to store the best values for player I/II in that particular sub-tree. If these values are found better than the current search then the search stops [9]. Move ordering is another enhancement added to prioritize stronger moves earlier rather than delay a win. The success of algorithms such as Alpha-Beta depend on the order which nodes corresponding to actions are visited [9] due to the depth first search. If the algorithm finds two win conditions, then both actions would be assigned large positive values and the first move visited would take priority over the second. Move Ordering aims to improve this by adding a small breadth search to the start of each node which provides an optimal sub-tree search order based on the next set of actions. This allows Minimax to perform more efficiently and search up to double the depth [9]. Each combination of Minimax algorithms – Minimax, Alpha-Beta (AB), Move Ordering (MO), and Alpha-Beta with Move Ordering (ABMO) – played each other once in each player position (due to the deterministic nature of each) at Connect-4. Every game resulted in a draw. Minimax-ABMO was selected as the strongest Minimax variation due to evidence of higher efficiency [6, 9] added by Alpha-Beta and Move Ordering. As far as we are aware, this is the first comparison between these enhancements and the base algorithm in the domain of Connect-4.

MCTS. MCTS aims to solve the multi-arm bandit decision problem where an agent must choose between A𝐴Aitalic_A actions to maximise cumulative reward. MCTS does this by randomly sampling the decision space and then iteratively building a search tree outward. The algorithm, therefore, assumes that the true action value can be approximated by random simulation and that these approximations can be used to adjust the decision policy [13]. Rather than use a reward function, MCTS estimates the action value using an average utility of reward over several iterations of finished random games [9]. There are four main stages which operate on the game tree in this order: selection, expansion, simulation, and backpropagation [9].

Selection requires a section policy to balance exploration and exploitation. Upper Confidence Bounds for Trees (UCT) applies the Upper Confidence Bounds 1 (UCB1) selection policy [17]:

UCT=vjnj+2Cp2lnNnj,𝑈𝐶𝑇subscript𝑣𝑗subscript𝑛𝑗2subscript𝐶𝑝2𝑙𝑛𝑁subscript𝑛𝑗UCT=\frac{v_{j}}{n_{j}}+2C_{p}\sqrt{\frac{2lnN}{n_{j}}},italic_U italic_C italic_T = divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + 2 italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 italic_l italic_n italic_N end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG , (5)

where j is node j𝑗jitalic_j in the game tree; vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, is the value of the outcome from the simulation phase; njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, is the number of times that node j𝑗jitalic_j has been visited by the algorithm; N𝑁Nitalic_N is the number of parent node visits; and Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a parameter that controls the exploration limit [13]. Equation 5 calculates the upper confidence bound that a move is optimal [13]. vjnjsubscript𝑣𝑗subscript𝑛𝑗\frac{v_{j}}{n_{j}}divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG is the average utility representing exploitation. As vjnjsubscript𝑣𝑗subscript𝑛𝑗\frac{v_{j}}{n_{j}}divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG increases so does the UCT score for the node, meaning it is more likely to be selected. The right-hand side represents the exploration of the solution space. Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is called the exploration term with larger values representing the algorithms tendency to prefer exploration over exploitation [9] and therefore equation 5 addresses the exploration-exploitation dilemma. Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT was selected using the theoretical value of Cp=12subscript𝐶𝑝12C_{p}=\frac{1}{\sqrt{2}}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG which has been shown to satisfy Hoeffding’s Inequality [17] representing an optimal choice for zero-sum games such as Connect-4 [13].

Starting at the root node, each child node is selected using (5) with the child with the highest UCT value selected. If the child is a leaf node, then this is the final selection. If the child is not a leaf node, then its own child’s highest UCT value node is selected. This process is repeated until a leaf node is reached [9]. If the leaf node has been visited by the algorithm before, then new children are added to the selected node in the game tree during the Expansion stage [9].

Simulation is applied from the selected leaf node if Expansion does not occur and applied to the first child of the leaf node if Expansion does occur [13]. Simulation is a play-out of the entire game until a terminal state is reached, choosing moves for both player I and II with uniform distribution (random play). New nodes are not added to the game tree at this stage of the algorithm [9].

The outcome (win/lose as Connect-4 is zero-sum game) from Simulation is recorded for use in the Backpropagation stage. This stage updates vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at each node from the simulated node all the way back to the root node [9]. According to equation 5, when wins are backpropagated, vjnjsubscript𝑣𝑗subscript𝑛𝑗\frac{v_{j}}{n_{j}}divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG increases meaning exploitation of the set of actions that lead to a win also increase and states that lead to a win/loss for the agent are more/less likely to be re-selected in the search process.

The above four steps are repeated until time, denoted maxTime𝑚𝑎𝑥𝑇𝑖𝑚𝑒maxTimeitalic_m italic_a italic_x italic_T italic_i italic_m italic_e, expires. At this point, the child node with the highest vjnjsubscript𝑣𝑗subscript𝑛𝑗\frac{v_{j}}{n_{j}}divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG is selected and the move representing this state is used as a decision for the algorithm [9]. maxTime𝑚𝑎𝑥𝑇𝑖𝑚𝑒maxTimeitalic_m italic_a italic_x italic_T italic_i italic_m italic_e has a positive relationship with the strength of the decision [13]. The MCTS algorithm is provided in Algorithm 3.

Algorithm 3 Monte-Carlo Tree Search
Input: State s𝑠sitalic_s, maxTime𝑚𝑎𝑥𝑇𝑖𝑚𝑒maxTimeitalic_m italic_a italic_x italic_T italic_i italic_m italic_e, Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
Output: action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n
1::1absent\quad 1:1 : rootNode𝑟𝑜𝑜𝑡𝑁𝑜𝑑𝑒rootNodeitalic_r italic_o italic_o italic_t italic_N italic_o italic_d italic_e = Function TreeNode(nj=0subscript𝑛𝑗0n_{j}=0italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0, vj=0subscript𝑣𝑗0v_{j}=0italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0)
2::2absent\quad 2:2 : While timeElapsed<maxTime𝑡𝑖𝑚𝑒𝐸𝑙𝑎𝑝𝑠𝑒𝑑𝑚𝑎𝑥𝑇𝑖𝑚𝑒timeElapsed<maxTimeitalic_t italic_i italic_m italic_e italic_E italic_l italic_a italic_p italic_s italic_e italic_d < italic_m italic_a italic_x italic_T italic_i italic_m italic_e:
3::3absent\quad 3:\quad3 : node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e = rootNode𝑟𝑜𝑜𝑡𝑁𝑜𝑑𝑒rootNodeitalic_r italic_o italic_o italic_t italic_N italic_o italic_d italic_e
4::4absent\quad 4:\quad4 : While node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e is not leaf:
5::5absent\quad 5:\quad\quad5 : node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e = Function Selection(node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e, Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT)
6::6absent\quad 6:\quad\quad6 : s𝑠sitalic_s = Function UpdateState(s𝑠sitalic_s, node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e.move)
7::7absent\quad 7:\quad7 : node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e.children = Function Expansion(s𝑠sitalic_s)
8::8absent\quad 8:\quad8 : While s𝑠sitalic_s is not terminal:
9::9absent\quad 9:\quad\quad9 : s𝑠sitalic_s = Function Simulation(s𝑠sitalic_s)
10::10absent\quad 10:\quad10 : result𝑟𝑒𝑠𝑢𝑙𝑡resultitalic_r italic_e italic_s italic_u italic_l italic_t = Function EvaluateWinLoss(s𝑠sitalic_s)
11::11absent\quad 11:\quad11 : Function Backpropagation(node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e):
12::12absent\quad 12:\quad\quad12 : While node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e has parent:
13::13absent\quad 13:\quad\quad\quad13 : vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = result𝑟𝑒𝑠𝑢𝑙𝑡resultitalic_r italic_e italic_s italic_u italic_l italic_t, njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + 1
14::14absent\quad 14:\quad\quad\quad14 : node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e = node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e.parent
15::15absent\quad 15:15 : return action𝑎𝑐𝑡𝑖𝑜𝑛actionitalic_a italic_c italic_t italic_i italic_o italic_n = Function BestMove(rootNode𝑟𝑜𝑜𝑡𝑁𝑜𝑑𝑒rootNodeitalic_r italic_o italic_o italic_t italic_N italic_o italic_d italic_e)

Continuing with our first contribution of the paper, MCTS was enhanced and compared to the base MCTS algorithm with a modification called Decisive Moves [18] which alters the the stochastic nature of MCTS to avoid situations where good moves are overlooked [9]. Decisive Moves ensure game deciding moves are controlled by a heuristic which ensures the algorithm takes any winning move it finds. If a winning move is not available but the opposing player could win in their next move, the heuristic ensures the oppositions move is blocked. If neither of these conditions are satisfied, then the UCT algorithm runs as normal. It is shown that this enhancement improves the strength of the algorithm with little additional computational costs when compared to the reference UCT method [18]. In line with this, we found that the MCTS-Decisive algorithm won 60%percent6060\%60 % of games against the base MCTS algorithm showing the increased playing strength of the algorithm, an intuitive result because only the weaknesses, not the strengths, of MCTS are modified. As far as we are aware, this is the first comparison between the Decisive Moves enhancement and MCTS in the domain of Connect-4.

III Evolutionary Tournament And Analysis

The second and third contribution of this paper will be presented in this section. The second contribution is contained within the Evolutionary Tournament Evaluation subsection and offers unique evolutionary process to analyse performance. The third contribution is an extensive analysis and is presented in the Algorithms Vs. Control Algorithms and Relative Agent Evaluation subsections. Previously, analysis has either concentrated on the relative performance of each algorithm when playing against each other [8], or against common adversaries [Scheiermann_2022], or move times [6] making comparison difficult due to the varying conditions. We use all these approaches in one consistent setting thereby offering a wider level of analysis in a single place. For the remainder of this paper, an algorithm class is defined as an algorithm type (Q-Learning, Minimax, MCTS) and each algorithm is denoted as the base class followed by enhancements and algorithm experience. For example, Minimax-ABMO-3 is Minimax with Alpha-Beta pruning, Move Ordering and d𝑑ditalic_d = 3. Q-Learning (Minimax) will be simplified to Q-Learning.

Evolutionary Tournament Evaluation. Axelrod’s tournaments [12] were round-robin tournaments evaluating strategies for the Iterated Prisoner’s Dilemma. These tournaments helped provide researchers with better understanding of the successful strategies in understanding the evolution of cooperation [19] and serve as the inspiration for using evolutionary tournaments here.

[19] extends the tournament analysis by investigating the evolutionary mechanisms of a strategy demographic within a population to find the existence, nature, and convergence of the strategy distribution. We consider our agents akin to sub-species in a population allocating each agent a species count of ten. Simple evolutionary rules were implemented to simulate the evolutionary process over time. Two agents were randomly selected from the population (Tournament Selection in [19]) to play a game of Connect-4. Each game of Connect-4 represents a generation with limited resources which two species fight over. If an agent wins then it reproduces (species count increases by one) to simulate domination over the other agent. The loser of the game dies and their species count decreases by one. While [19] uses scoring in their tournament, a ’death penalty’ is suitable in Connect-4 because of the zero-sum nature of the game. In generation 1, players are initially selected with a uniform distribution due to the species population equality. Over time dominating species which win more games and reproduce giving them a higher chance of selection. This is called Selection Pressure which drives the evolutionary process to favour stronger agents [20]. This also acts as a robustness test because stronger algorithms are chosen more often and therefore must consistently win the games of Connect-4 against others.

Two tournaments are performed: an elite tournament (see Fig. 1), consisting of just the enhanced variations and maximum experience levels in each class (MCTS-Decisive-5, Minimax-ABMO-6 and Q-Learning (100k), and a grand tournament (see Figs. 2-3), containing 28 algorithms from every algorithm type and experience level. Experience was defined as games trained for Q-Learning, depth for Minimax, and seconds for MCTS and are described in section II. Both tournaments included three standardised control algorithms. Evaluating against controls allows a standard metric to measure performance across the literature. Three controls were selected as opponents with their relative strength assessed in a multi-game round-robin style fixture. In ascending strength, these are: Random, picks an action with a uniform probability distribution, see, e.g., [11, 5]; Supervised, a semi-weak algorithm inspired by [3]; and Heuristic, a rule-based algorithm which would choose the best move based on simple rules such as reward/payoff functions defined in section II.

Refer to caption
Figure 1: Elite Evolutionary Tournament.
Refer to caption
Figure 2: Grand Evolutionary Tournament.
Refer to caption
Figure 3: Grand Evolutionary Tournament: MCTS Class.

Vertical dotted lines on each figure represent the generation that particular species was eliminated from the population. Flat solid lines represent no change in population share, positive solid lines represent increasing population share and negative solid lines represent declining population share. The elite tournament reached generation 94 while the grand tournament reached generation 2,151 before evidence of population share stability.

In the elite tournament, MCTS-Decisive-5 was the strongest performing algorithm, eliminating all other species and finishing with a 100% population share. A relatable equivalent of this behavior in evolutionary game theory is refereed to as an Evolutionarily Stable Strategy (ESS), which extends the concept of Nash equilibrium to a situation where the current population is immune to invasions [16]. This also shows the robustness of MCTS-Decisive-5 which consistently beat the other algorithms, such as the case after generation 83 where the trends of Minmax-ABMO-6 and MCTS-Decisive-5 were inversely associated. Between generation 76-80 and 82-94 we notice a steady population increase for MCTS-Decisive-5. Five significant events occur over these two periods: the elimination of the five other algorithm species. The selection pressure of MCTS-Decisive-5 increased as it started to dominate and it was repeatedly selected to play against the other algorithms. The second algorithm was Heuristic (eliminated at generation 93), third was Minimax-ABMO-6 (generation 91), fourth was Supervised (generation 85), fifth was Random (generation 77) and sixth was Q-Learning (100k) (generation 76). While Minimax-ABMO-6 was placed third, the results indicate that it was the second strongest algorithm because its population was largely was above its original share of 16% and many of its upward trends correspond to the downward trends of other algorithms indicating its strength (for example, at generation 30 against MCTS-Decisive-5). Q-Learning (100k) was the worst performing algorithm getting eliminated before any of the controls. Based on previous evaluation we expected Q-Learning (100k) to outlast the random and possibly the supervised control but the stochastic nature of the selection mechanism contributed to this, for example generations 6-8 and 25-28 where Q-Learning (100k) was against the stronger opponents, MCTS-Decisive-5 and Minimax-ABMO-6 rather than weaker control algorithms.

The grand tournament showed the domination of the MCTS and MCTS-Decisive algorithm classes which all (except MCTS-Decisive-2) remained at the end of the tournament unlike the other 21 algorithms which were eliminated. The population of these classes began to stabilize after the last non-dominate class was eliminated and showed signs of fluctuation around a stationary point between generation 1,700-2,151 and therefore the tournament was stopped. Any small fluctuations are likely due to randomness in being selected as the first or second player where the former has the advantage of first move. When one algorithm does not capture 100% of population share we can not definitively say that one algorithm was the strongest due to the stochastic nature of the selection. However, the evolutionary mechanism does show the domination of the MCTS type class clearly. The surviving algorithms were grouped into three sets around a population percentage representing relative strength of each group. Position one consisted of MCTS-4, MCTS-Decisive-4, and MCTS-Decisive-5; position two consisted of MCTS-5 and MCTS-Decisive-3; position three consisted of MCTS-2 and MCTS-3. The other algorithms positions were Minimax-AB (generation 1300), Minimax-MO (generation 1075), Minimax-ABMO (generation 1050), Q-Learning (generation 850), Minimax (generation 625), the controls (generation 450). MCTS-Decisive-2 was theoretically one of the weakest of the dominating MCTS type class’ due to its low move time and therefore its elimination was not surprising. Minimax enhancements performed better individually (Minimax-AB, Minimax-MO) than as one algorithm (Minimax-ABMO). Q-Learning performed better than the Minimax base algorithm, similar to results in [8]. These results show a correlation between algorithm placement and move time and a clear positive impact of the Decisive Moves enhancement [18].

Algorithms Vs. Control Algorithms. The best performing agent in each algorithm class as defined in section II was evaluated against three control algorithms, as defined in Evolutionary Tournament Evaluation. Each algorithm class was assessed at differing experience levels against the control algorithms. Evaluation was conducted across two metrics: move timing, which is important for real time algorithm applicability [10], and win rate (%) [3, 11, 5]. Each algorithm-experience combination played 50 games against each control with a 50-50 split between player I and II to assess the strength of each algorithm with differing experience. Figure 4 presents results on average move timing as well as win percentages for each of the algorithms against the control algorithms.

Refer to caption
Figure 4: Algorithm Vs. Control.

The Q-Learning algorithm class has the fastest mean move time with a high of 0.12s for Q-Learning (100k) and a low of 0.06s for Q-Learning (80k). Games trained and mean move time was positively correlated due to the increased size of the Q-Table as games experience increases. For Minimax-ABMO, the mean move time and depth are positively correlated with a high of 4.26s for Minimax-ABMO-5 and a low of 0.45s for Minimax-ABMO-4. This is in line with results reported in [6] where a Minimax Alpha-Beta agent d=4𝑑4d=4italic_d = 4 took 0.006s to make a move. MCTS mean move time and computation time are also positively correlated with a high of 4.45s for MCTS-Decisive-5 and a low of 1.58s for MCTS-Decisive-2. With the exception of MCTS-Decisive-3, the mean move time of each Decisive move enhanced algorithm is below its computation time, showing a more efficient outcome over the standard MCTS algorithm. This is because when the heuristic is activated it controls the move decision and not the MCTS process, resulting in a lower mean move time.

Q-Learning is the quickest class of algorithms with each variation taking less than 0.13s mean move time. Minimax was slightly better than MCTS in terms of speed with the upper limits of each algorithm class taking 4.26s and 4.45s and respectively. It is expected that Minimax would exponentially increase mean move time as the depth parameter increases [6] unlike MCTS as outlined in section II. A adequate timing convention is currently lacking in the literature and therefore the importance of these results are defined by the application. Some may require quicker moves, while other applications could be more forgiving.

The win rates of all three algorithm classes increases with experience showing the learning success each algorithm at playing Connect-4. Against the Random control, all MCTS agents have a win rate of 100%, Minimax has a mean win rate of 99.5% due to Minimax-ABMO-3, and Q-Learning has a average win rate of 75.5%. Against the Supervised control, all MCTS algorithm types have a win rate of 100%, Minimax has a mean win rate of 87.5% due to Minimax-ABMO-3 and Q-Learning has a mean win rate of 69.5%. Minimax-ABMO-3 performs the worst out of all 12 algorithms against the Supervised control, indicating that a depth of 3 is too shallow for Minimax to play Connect-4 effectively. Against the Heuristic control, MCTS has a mean win rate of 94.75%. Minimax has a mean win rate of 75% due to the irregularity of performance across different depths: d=3𝑑3d=3italic_d = 3 and d=5𝑑5d=5italic_d = 5 (odd depths) win 50% of games while d=4𝑑4d=4italic_d = 4 and d=6𝑑6d=6italic_d = 6 (even depths) win 100% of their games. This is contrary to the expectation that deeper searches produce a stronger performance [6] but are in line with results in [8] where a d=2𝑑2d=2italic_d = 2 Minimax agent outperformed a d=3𝑑3d=3italic_d = 3 agent against a constant advisory. The fact these results are repeated suggest that, for certain contexts, deeper does not always translate into a stronger performance. It could be possible that odd depths cause evaluation of the terminal state to become unbalanced as the opposing player moved last causing a lower max score compared to even depths. Similarly, [8] explains their results as the exploration of sub-optimal game trees caused by the odd search depth. Greater research over a larger range depths is needed for any solid conclusions.

The Q-Learning agent performed poorly against the Heuristic agent with a mean win rate of just 1%. This is likely explained because Q-Learning plays randomly in states it has not experienced yet [5]. Due to the superiority of the Heuristic control compared to random play, its likely states not experienced were used highly and therefore Q-Learning under-performed compared to against other control algorithms.

Relative Agent Evaluation. Each algorithm (selected based on enhancement and experience from the previous evaluation) played a series of games against the others to assess their relative strengths. Rounds of 100 games between every combination of algorithms was played. These were split 50-50 between player I and II so each algorithm plays in the first and second position. Table I shows the mean win rate, draw rate, and move time of each combination of aggregated games.

TABLE I: Algorithms vs algorithms.
Algorithm Vs. Win Mean Move
Rate (%) Time (s)
MCTS-Decisive-5 Minmax-ABMO-6 69 5.69
MCTS-Decisive-5 Q-Learning (100k) 100 2.12
Minmax-ABMO-6 MCTS-Decisive-5 27 3.4
Minmax-ABMO-6 Q-Learning (100k) 100 4.25
Q-Learning (100k) MCTS-Decisive-5 0 0.11
Q-Learning (100k) Minmax-ABMO-6 0 0.05

Q-Learning (100k) was the weakest in this evaluation losing every game against Minimax-ABMO-6 and MCTS-Decisive-5. On the other hand, the algorithm had the lowest mean move time, taking just 0.05s and 0.1s per mean move against Minimax-ABMO-6 and MCTS-Decisive-5 respectively. While this is encouraging for applications that requires a low mean move speed, the performance is so poor that it likely will not represent a viable solution to the game. Minimax-ABMO-6 was the second-best algorithm winning 100% of games against Q-Learning (100k) and 27% of its games against MCTS-Decisive-5. It took 3.4s mean move time against MCTS-Decisive-5 and 4.3s against Q-Learning (100k). The algorithms mean move time is lower than MCTS-Decisive-5 but higher than that of Q-Learning (100k) making the algorithm the middle ground for both performance and mean move time. Interestingly, the results show the algorithm is quicker against the stronger opponent (based on win rate). One possibility for this is that a tougher opponent would play a longer game and a weaker opponent would play a shorter game. Later game moves would require less calculations because there would be less remaining moves to search which would reduce mean move time. In contrast, early game moves would cause a higher mean move time as Minimax evaluates more possible moves to its search depth [9].

MCTS-Decisive-5 was the best performing algorithm winning 100% of games against Q-Learning (100k) and 69% of games against Minimax-ABMO-6. It took 5.7s mean move time against Minimax-ABMO-6 and 4.1s against Q-Learning (100k). This time difference is likely due to the relative strength of both the oppositions. Better moves would block more game deciding moves resulting in fewer activations of the Decisive Moves enhancement heuristic [18] resulting in a slower mean move speed.

IV Conclusion

In this paper, we have presented a comparative and systematic analysis of Q-learning, Minimax and MCTS. After formally introducing each of these algorithms, we have thoroughly compared their strengths and weaknesses on a single domain, namely, Connect-4. This has allowed us to directly assess the strengths and weaknesses of the algorithms in a common setting with identical branching factor. We have carried out this comparison through several approaches and introduced a novel method, the Evolutionary Tournament. Finally, we have showed that MCTS-Decisive is the strongest algorithm class in terms of win rate. This was supported by similar results in [8]. Q-Learning came first in terms of move speed, but its win performance was poor, likely due to a lack of training games compared to other literature implementations [15, 8]. Minimax held the middle ground in terms of both win rate and mean move speed. Future directions of work includes: i) enhancing the rules of the Evolutionary Tournament for comparative contexts enabling more accurate placement of non-dominant algorithms (See [19] for a score based approach), ii) expanding mathematical modeling of the tournament to evolutionary dynamics or discrete Markov chains to model the growth/decay of algorithm species, iii) investigating the impact of odd and even Minimax depths for contexts such as Connect-4 as found in our paper and in [8].

References

  • [1] D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search”, Nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [2] X. Kang, Y. Wang and Y. Hu, “Research on different heuristics for Minimax algorithm insight from connect-4 game”, Journal of Intelligent Learning Systems and Applications, vol. 11, pp. 15–31, 2019.
  • [3] M. Schneider and J. Garcia Rosa, “Neural connect 4 - a connectionist approach to the game”, Brazilian Symposium on Neural Networks, 2002, pp. 236–241.
  • [4] H. van den Herik, J. W. Uiterwijk and J. van Rijswijck, “Games solved: Now and in the future”, Artificial Intelligence, vol. 134, no. 1, pp. 277–311, 2002.
  • [5] H. Wang, M. Emmerich and A. Plaat, “Assessing the potential of classical q-learning in general game playing”, Communications in Computer and Information Science, vol 1021, pp. 138–150, 2018.
  • [6] R. Nasa, R. Didwania, S. Maji, and V. Kumar, “Alpha-beta pruning in mini-max algorithm –an optimized approach for a connect-4 game”, International Research Journal of Engineering and Technology (IRJET), vol. 5, 2018.
  • [7] J. Scheiermann and W. Konen, “AlphaZero-inspired game learning: Faster training by using MCTS only at test time”, IEEE Transactions on Games, pp. 1–11, 2022.
  • [8] M. Dabas, N. Dahiya and P. Pushparaj, “Solving connect 4 using artificial intelligence”, International Conference on Innovative Computing and Communications, Singapore: Springer Singapore, 2022, pp. 727–735.
  • [9] S. Russell and P. Norvig, Artificial intelligence: A Modern Approach. 4th ed. Harlow, Essex, UK: Pearson, 2021.
  • [10] E. R. Escandon and J. Campion, “Minimax checkers playing GUI: A foundation for AI applications”, IEEE XXV International Conference on Electronics, Electrical Engineering and Computing (INTERCON), 2018, pp. 1–4.
  • [11] J. Persson and T. Jakobsson, “Self-learning game player – connect-4 with q-learning”, Bachelor’s dissertation, KTH Royal Institute of Technology, Stockholm, Sweden, 2011.
  • [12] R. Axelrod and W. D. Hamilton, “The Evolution of Cooperation”, Science, vol. 211, no. 4489, pp. 1390–1396, 1981.
  • [13] C. Browne, E. Powley, D. Whitehouse, et al., “A survey of monte carlo tree search methods”, IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, 2012.
  • [14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. 2nd ed. Cambridge, Massachusetts, USA: MIT Press Ltd, 2018.
  • [15] O. Arvidsson and L. Wallgren, “Q-learning for a simple board game”, Bachelor’s dissertation, KTH Royal Institute of Technology, Stockholm, Sweden, 2010.
  • [16] M. Maschler, E. Solan and S. Zamir, Game Theory. Cambridge, Cambridgeshire, UK: Cambridge University Press, 2013.
  • [17] L. Kocsis and C. Szepesvári, “Bandit based monte-carlo planning”, Machine Learning: ECML 2006, Berlin, 2006, pp. 282–293.
  • [18] F. Teytaud and O. Teytaud, “On the huge benefit of decisive moves in monte-carlo tree search algorithm”, Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, 2010, pp. 359–364.
  • [19] S. Airiau, S. Saha and S. Sen, “Evolutionary Tournament-Based Comparison of Learning and Non-Learning Algorithms for Iterated Games”, Proceedings of the Eighteenth International Florida Artificial Intelligence, pp. 449–454, 2005.
  • [20] B. L. Miller and D. E. Goldberg, “Genetic algorithms, tournament selection, and the effects of noise”, Complex Systems, vol. 9, 1995.