More Victories, Less Cooperation: Assessing Cicero’s Diplomacy Play

Wichayaporn Wongkamjan1       Feng Gu1       Yanze Wang4       Ulf Hermjakob4
Jonathan May4       Brandon M. Stewart2       Jonathan K. Kummerfeld3
Denis Peskoff2       Jordan Lee Boyd-Graber1
1
University of Maryland       2Princeton University       3University of Sydney
4Information Sciences Institute, University of Southern California
{wwongkam,fgu1}@umd.edu       {yanzewan,ulf,jonmay}@isi.edu
[email protected]    [email protected]   
[email protected]    [email protected]
Abstract

The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy ai, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an ai challenge. This work seeks to understand the degree to which Cicero succeeds at communication. First, we annotate in-game communication with abstract meaning representation to separate in-game tactics from general language. Second, we run two dozen games with humans and Cicero, totaling over 200 human-player hours of competition. While ai can consistently outplay human players, ai–Human communication is still limited because of ai’s difficulty with deception and persuasion. This shows that Cicero relies on strategy and has not yet reached the full promise of communicative and cooperative ai.

\xpatchcmd\@setref

?reference?

More Victories, Less Cooperation: Assessing Cicero’s Diplomacy Play


Wichayaporn Wongkamjan1       Feng Gu1       Yanze Wang4       Ulf Hermjakob4 Jonathan May4       Brandon M. Stewart2       Jonathan K. Kummerfeld3 Denis Peskoff2       Jordan Lee Boyd-Graber1 1University of Maryland       2Princeton University       3University of Sydney 4Information Sciences Institute, University of Southern California {wwongkam,fgu1}@umd.edu       {yanzewan,ulf,jonmay}@isi.edu [email protected]    [email protected] [email protected]    [email protected]


1 Diplomacy Requires Communication

In a landmark paper, Bakhtin et al. (2022) introduce Cicero, an ai that plays the game Diplomacy. The Washington Post claims “the model is adept at negotiation and trickery” Verma (2022), Forbes asserts “Cicero was able to pass as a human player” (Porterfield, 2022), and even the scientific publication’s editor states ai “mastered Diplomacy” (Bakhtin et al., 2022). This work tests those popular perceptions to rigorously evaluate the communicative and strategic capabilities of Cicero. Our observations lead to insights about the current state of cooperation and communication in ai, highlighting its deceptive and persuasive characteristics.

While Cicero plays strategically and with a verisimilitude of human communication, the evaluation in Bakhtin et al. (2022) focuses only on if Cicero wins games. As the name implies, Diplomacy is revered by its devotees as a game of nuanced negotiation (Kraus and Lehmann, 1995), convincing persuasion, and judicious betrayal. We argue that mastering Diplomacy requires these communicative skills (Section 2). Measuring persuasion, deception, and cooperation are open problems with no clear solution. A boardgame constrains the world of actions to make these measurements feasible. One contribution of this paper is to build measurements of these communicative skills and to evaluate the true state of ai play in Diplomacy.

However, a technical challenge to identifying persuasion and deception is mapping from communication to in-game actions (e.g., verifying that I follow through on a promise of helping you): persuasion uses words to convince someone to do something, and deception is saying something to alter another’s belief Chisholm and Feehan (1977). In both cases, we map and contrast communication to agents’ actions.

To enable this mapping, we annotate messages from human games with abstract meaning representation (amr, Banarescu et al., 2013), which we use to train a grounded system to infer the goals of Diplomacy players (Section 3). After validating we can extract communicative intents, we use these representations to identify persuasion and deception (Section 4). We then remove some of Cicero’s communicative ability: this does not impair its ability to win games.

We then test Cicero’s skills in games against humans (Section 5) while asking players to annotate if they think other players are an ai and if a message is a lie. Confirming earlier work, Cicero wins nearly every game, including against top players. However, our new annotations provide a counterargument to the prevailing view that Cicero has mastered the communicative aspect of the game that is the priority of the nlp community. Cicero plays “differently”; humans can reliably identify Cicero and it is less deceptive and persuasive to human players. Communication from Cicero is more transactional, relying on its optimal strategy rather than the alliance building which is the hallmark of top human players. Cicero has yet to prove effective in the communication skills which are crucial to achieving goals in strategic games and real life.

In Section 7, we discuss what it would take for a computer to truly master Diplomacy and how Diplomacy’s intrinsic persuasion and deception can improve computers’ ability to not just talk like a human but to realistically tie words to actions.

2 Diplomacy as Human-AI Communication Test Bed

Diplomacy is a strategic board game that combines negotiation and strategy, where players take on the roles of various European powers (nations) on the eve of World War I. The essence of the game lies in forming and betraying alliances to control territories, requiring adept diplomacy (hence the name of the game) and strategic planning. Some players focus on aggressive tactical decisions, while others focus on making alliances, communicating, and collaborating with others for better outcomes Pulsipher (1982). The goal of the game is to capture territory, board regions called supply centers: once you capture enough of these supply centers, you win the game.

The charm and challenge of Diplomacy messages is that players are free to talk about anything, either strategy-related or not. Most messages in Diplomacy are about (1) past, current, and future turns, (2) alliance negotiations, and (3) acquiring and sharing third-party information such as, “Did Turkey talk to you?”, “I can’t believe you attacked Russia in SEV.” A few messages are (4) non-Diplomacy conversation, e.g. ‘How are you today?

Refer to caption
Figure 1: Our goal is to detect when players use persuasion and deception and compare human players to Cicero. First, we retrieve initial orders (left), then extract moves from natural language communication (middle) through amr  (Section 3), and later detect deception and persuasion (Section 4) conditioning initial intents and final orders (right). We show two possibilities: (top) Germany breaks its commitment to England by moving to Norway instead of Sweden, and (bottom) England successfully persuades Germany if Germany moves its unit to Sweden as England suggests and this move is not in Germany’s initial orders.

Communication in Diplomacy is attractive to academic researchers because it can be linguistically and cognitively complex, but is grounded in a constrained world with well-defined states and dynamics.111Diplomacy without communication, sometimes called “gunboat” has been studied with rule-based, rl, and other approaches (as we detail in more depth in Appendix A). We limit gunboat discussion here as our focus is on the communicative aspect of the game. The Diplomacy ai Development Environment (Rose and Norman, 2002, daide) is a structured syntax for ai agents to play Diplomacy: create alliances, suggest moves, etc. Several agents have used this stripped-down communicative environment: Albert van Hal (2009), SillyNegoBot Polberg et al. (2011), DipBlue Ferreira et al. (2015), inter alia. Unlike these agents, Cicero Bakhtin et al. (2022) uses a large language model to enable free-form English communication, enabling “normal” play with humans. Cicero excelled in an online Diplomacy league, scoring over twice the average of human participants and ranking in the top 10% among those who played multiple games. In the original work, it is unclear if Cicero’s success is due to its use of natural language or its strategic model. In our Human–Cicero studies, we ask that participants annotate messages they perceive as deceptive, allowing us to more carefully study the communicative aspects of the game (more details are in Section 5).

2.1 Diplomacy without Communication is not Diplomacy

Having discussed the basics of Diplomacy, we now turn to what makes the game unique. Because the game is relatively balanced between seven players at the start, players need to form alliances if they hope to gain an advantage. However, these alliances should be mutually beneficial; from a player’s perspective, they need to advocate for cooperation that benefits themselves. This requires effective persuasion Cialdini (2000): making appeals to scarcity, reciprocity Kramár et al. (2022), unity, or shared norms. This is a communicative task which involves social and emotional skill: picking the right moves and convincing other players to help them.

However, the ultimate goal of Diplomacy is for individual players to win the game. This means that alliances will fall apart, leading to deception Peskov et al. (2020) as part of a betrayal Niculae et al. (2015). Because a player might benefit from a victim thinking that they are working together, a betrayer often sets up the tactical conditions for a betrayal while obfuscating their goals through cleverly composed deceptive messages (even if not outright lies).

Because deception and betrayal are communicative acts both necessary for mastering and enjoying Diplomacy and grounded in the state of the game, the next sections develop tools to detect when they happen. This will allow us to measure whether ai agents like Cicero have mastered both tactics and communication.

3 Grounding Messages Into Intentions

Consider this in-game statement made by England to Germany about a specific move-set (glosses added to locations):

You can steal STP [St. Petersburg] from Russia if you’re in SWE [Sweden] next turn. I will support you there.

We want to be able to tell if the speaker is lying (e.g., they’re going to do something else instead of what they claim they’re going to do) and if the speaker has convinced the recipient to alter their actions. This is necessary to measure how effectively Cicero communicates in the game.

While we know the intended actions of players when they submit their moves, we need to see how those moves match up to their communications in the discussion period before they submit moves (Figure 1). We use amr to build a machine-readable representation of the intent of actions in their communications. We are not starting from scratch: daide (Section 2) provides a set of predicates (ally, move, etc.) critical to Diplomacy communications. We thus focus on annotating these predicates that encode actions, allowing us to understand the communicative intent of messages, where speakers could say they will do something and follow through, or say they will do something and not follow through.

Because not all information needed for annotation is in the raw message text, we further show human annotators who wrote the text (e.g., France, Germany), seasons (e.g., Spring 1901), and the current game state. This information is necessary to annotate “You can steal STP from Russia if you are in SWE this turn. I will support you there” in the earlier example so that the annotators can assume what unit would support and what unit would move into Sweden. In this case, England’s fleet in Norway supports a German fleet in Skagerrak to move into Sweden.

3.1 Annotation

Like any specialized domain, Diplomacy has its unique vocabulary. Taking the above statement as an example, we extend the amr vocabulary to include not only abbreviations, such as “SWE” for Sweden, but also verbs like “threaten” and “demilitarize” (to set up a demilitarized zone), as well as to describe actions like gaining, holding, or losing provinces, especially supply centers (“SC”), which are equivalent to points and integral to winning the game. In contrast to standard amr annotation, where every sentence is fully annotated, Diplomacy amr annotation sometimes involves only partial or no annotation for certain utterances, depending on their relevance to gameplay strategies like forming alliances or making moves, exemplified by amr concepts such as ally-01, move-01, and attack-01 (the full extended vocabulary introduction in Appendix B.1).

In a preliminary annotation phase, we have Diplomacy experts annotate sentences from Peskov et al. (2020) to train human annotators and refine the Diplomacy Appendix of the amr Annotation Dictionary. We annotate 8,878 utterances (ranging from a word to several sentences). 4,412 of those utterances are annotated as empty amrs (e.g. for “Lemme think about your idea”) indicating no in-game move intent. 598 of the annotated amrs contain full information extracted from messages of Diplomacy games. The remaining 3,868 amrs contain 3,306 utterances with underspecified information such as units with missing type, location (Figure A1), and nationality (Figure A2), as well as 562 agreements with a missing object. Many utterances contain underspecified information, as Diplomacy players often communicate with messages that lack specific details (which are implicit and can be inferred from the game state). The annotated amr corpus is further used for training our English-to-amr parser to extract communicative intent information from utterances.

3.2 Training a Parser to Detect Intentions

We use a sequence-to-sequence model Jascob (2023) fine-tuned on the amr 3.0 dataset Knight et al. (2020) as the baseline model to detect communicative intent in new conversations. Following other amr work, we report parsing accuracy via the widely-used smatch score Cai and Knight (2013). We divide the annotated corpus into 5,054 training, 1,415 validation and 2,354 test sentences, where each sentence is in the train / validation / test folds, split by game (Peskov et al., 2020).

We improve the model, starting from a baseline version without fine-tuning with smatch of 22.8. Our domain-tuned model using Diplomacy-amr improves smatch by 39.1, to 61.9. Adding data augmentation into the model (e.g., knowing the sender of a message is England and the recipient is Germany) improves smatch to 64.6. Adding separate encodings for this information further improves smatch by 0.8 (65.4). Additionally, we apply data processing to replace (1) pronouns with country names and (2) provinces in abbreviations with full names, which increases smatch to 66.6 (More parser details in Appendix B.2). This parser enables us to evaluate the role that communication has in Cicero’s capabilities.

3.3 Does Cicero need to Talk to Itself to Win?

The first question is whether communication matters in games with other Cicero agents (deferring the question of competition with humans to later sections). We have Cicero variants with different levels of communication abilities—ranging from “gunboat” without any messages (Appendix A) to full Natural Language capabilities—play each other and evaluate the results.

For the seven Cicero variants in each game, we randomly select three to have communicative ability; the remaining four play communication-less “gunboat.” The selected three communicative powers have the same communication level. We define a set of communication levels, from more communicative to less communicative: {itemize*}

Natural Language: the Cicero agent of Bakhtin et al. (2022) with full natural language

amr: only messages about game actions (i.e. those that are parsed by amr) go through, allowing the agents to coordinate game actions (Appendix B.1)

Random: a random message from a corpus of previous Diplomacy games is sent,222We match the assigned power of the sender and receiver and the year, which makes the message slightly more convincing (but unlikely to be consistent with the game state). mimicking form without content. Cicero plays 180 games with itself; 60 games for each communication level. In each game, we stop after 14 movement turns with 10 minutes of communication for each turn. We randomly select which power the agents are assigned to, so power distribution is balanced.

We measure performance by the number of supply centers (and thus how well the agent played the game, Appendix B.3). Consistent with our hypothesis that performance is driven by tactics, the gains Cicero gets from communication is substantially smaller than the gains from playing a stronger power (Figure 2): Playing as France (fra) yields an expected 2.8 additional supply centers (2.0–3.6 95% interval) compared to the median power Russia (rus). In contrast, the best language condition amr only yielded an expected 0.2 additional supply centers (-0.5–0.9 95% interval). In other words, the effect of choosing the best power over the median power is 14 times larger than the best communication strategy. This is consistent with prior findings (Sharp, 1978) that France is the easiest power to play and our other findings that Cicero’s communicative ability plays no clear role in its win rate.

To better understand what Cicero is using communication for, we build on our amr representations to capture intent in the next section.

Refer to caption
Figure 2: Power assignment is strongly predictive of Cicero performance as measured by supply center gains. Coefficients (with 95% confidence intervals) from a linear regression with Random Message/Russia as a baseline, show that the effect of choosing the effect of changing language systems is trivial compared to changing powers.

4 Promises Made, Promises Kept, and Finding Dirty Lies

In the example from the previous section (Figure 1), England says that it would support Germany’s move into Sweden from the Skagerrak Sea, while Germany agrees with the proposal from England. What does it mean for these to be deceptive or persuasive?

The geopolitical definition of deception is to manipulate adversaries’ perceptions to gain strategic advantage (Daniel and Herbig, 1982), e.g., Germany alters England’s belief so that England would do an action that benefits Germany (i.e. Germany has a better chance of ending up with higher score). However, evaluating deception is challenging because it requires estimating the differences in England’s beliefs before and after Germany deceives England. Therefore, we break down a broader, amorphous concept into easier-to-handle concepts and leave broader deception to future work. The first subconcept is breaking of a commitment (Kramár et al., 2022): someone saying they will do something and not following through. In the example, if Germany commits to moving to Sweden but later attacks England in Norway, this will be detected as a broken commitment.

The second subconcept is lying: human players in games with Cicero annotate messages their messages as either truthful or deceptive. We build on Peskov et al. (2020), who define deception to players thusly: “Typically, when [someone] lies [they] say what [they] know to be false in an attempt to deceive the listener.” Again, deception is broader than lying (and there are top Diplomacy players who intentionally deceive while never outright saying anything untrue), and our definition of deception is slightly broader than Peskov et al. (2020). In our more permissive annotations, human players mark interactions that are broken commitments, lies about other players, hedging about the state of alliances, and anything else that they feel is deceptive. Despite this ontological uncertainty, for convenience, we refer to the process of humans annotating messages as “human lie annotations” for consistency with Peskov et al. (2020). To recap: a broken commitment can be a lie and an example of deception.333Technically, not all broken commitments are lies: it could be an honest mistake. It’s also possible that a player says that they “typed in their order wrong, sorry!” which is a lie to cover up (Daniel and Herbig, 1982) a broken commitment as part of a broader deception strategy. Likewise, not all lies are broken commitments, but both breaking a commitment and lying are facets of deception. While we cannot capture all deception—because it is based on internal state—it is important to capture as much as we can to measure how important it is to playing Diplomacy well.

Specifically, Cicero’s ability (or inability) to decieve or persuade has never been empirically measured, so we build on the amr parser of the previous section to detect broken commitment and persuasion. As we discuss in Section 3.2, this is not perfect, but it has good coverage when players discuss their intentions. First, we parse English messages to amr structures, for which we define actions that the speaker intends to do, e.g. we can extract Germany’s communicative intent (F SKA - SWE) when Germany agrees with England that they will move to Sweden (middle, Figure 1). We also define orders players submit before any communication as initial intents and final orders as orders that players submit when the turn ends.

Using initial intents, communicative intent, and final orders, we can now define broken commitment and persuasion. For a broken commitment, we say that Germany violates a commitment with England if Germany verbally agrees to move to Sweden but actually attacks England in Norway (Deception by Germany, Figure 1). Breaking a commitment may result when an intent changes but is not communicated. For example, Germany agrees with England to move to Sweden but instead moves to Denmark to defend against France without informing England. Although this may not involve deceptive intent, we still consider it deception because it alters the listener’s beliefs and affects decision-making. For instance, England might decide to support Germany based on their agreement. For persuasion, England’s request is considered persuasive if Germany moves to Sweden, as England suggests, instead of Germany’s original plan to move to the North Sea (Persuasion by England, Figure 1). We describe each of these more formally in this section.

4.1 Broken commitment

We define broken commitment in Diplomacy when a player i𝑖iitalic_i commits to doing an action amsgijsubscriptsuperscript𝑎𝑖𝑗msga^{i\to j}_{\text{msg}}italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT and does not do it. In other words, given a set of final orders 𝐀finalisubscriptsuperscript𝐀𝑖final\mathbf{A}^{i}_{\text{final}}bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT from player i𝑖iitalic_i, if amsgij𝐀finalisubscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑖finala^{i\to j}_{\text{msg}}\notin\mathbf{A}^{i}_{\text{final}}italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ∉ bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT, then this is a broken commitment, i.e.,

BC(amsgij,𝐀finali)={1,if amsgij𝐀finali0,otherwise.BCsubscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑖finalcases1if subscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑖final0otherwise.\text{BC}(a^{i\to j}_{\text{msg}},\mathbf{A}^{i}_{\text{final}})=\begin{cases}% 1,&\text{if }a^{i\to j}_{\text{msg}}\notin\mathbf{A}^{i}_{\text{final}}\\ 0,&\text{otherwise.}\end{cases}BC ( italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , end_CELL start_CELL if italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ∉ bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise. end_CELL end_ROW (1)

Note that a player i𝑖iitalic_i agreeing to player j𝑗jitalic_j’s proposal to do action amsgijsubscriptsuperscript𝑎𝑖𝑗msga^{i\to j}_{\text{msg}}italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT is equivalent to directly committing to doing that action.

4.2 Persuasion

Broken commitment is in some ways easier to detect than persuasion, as we are only comparing a spoken intent to a final action. Persuasion is more difficult because we must discover initial intents, then compare them to communication and to final moves.

Because we want to be able to measure persuasion for both humans and for Cicero, we need comparable representations of initial intents for both. Thankfully, Cicero’s architecture uses a conditional language model (Bakhtin et al., 2022, Equation S2, section D.2) that generates its natural language messages given a set of moves (e.g., France internally decides it will do F MAO - POR, A BUR - MAR and A MAR - PIE) and then its messages reflect those intents. We directly use this set of intents from Cicero as initial intents in the persuasion detection. For humans, we explicitly ask all players to provide their planned moves (i.e., the same information that Cicero uses in its internal representation) before the negotiation turn begins (Section 5). In other words, we ask humans to directly input their intent, unlike Cicero, where we log its computational intent.

Persuasion happens when player i𝑖iitalic_i talks to player j𝑗jitalic_j, suggests an action amsgijsubscriptsuperscript𝑎𝑖𝑗msga^{i\to j}_{\text{msg}}italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT, and then player j𝑗jitalic_j makes a set of final orders 𝐀finaljsubscriptsuperscript𝐀𝑗final\mathbf{A}^{j}_{\text{final}}bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT that is different from their initial intents 𝐀intentjsubscriptsuperscript𝐀𝑗intent\mathbf{A}^{j}_{\text{intent}}bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT intent end_POSTSUBSCRIPT. In other words, player j𝑗jitalic_j is persuaded by player i𝑖iitalic_i if they commit an action suggested by player i𝑖iitalic_i, amsgij𝐀finaljsubscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑗finala^{i\to j}_{\text{msg}}\in\mathbf{A}^{j}_{\text{final}}italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ∈ bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT that was not player j𝑗jitalic_j’s initial intent amsgij𝐀intentjsubscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑗intenta^{i\to j}_{\text{msg}}\notin\mathbf{A}^{j}_{\text{intent}}italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ∉ bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT intent end_POSTSUBSCRIPT. We define persuasion Per(𝐀intentj,amsgij,𝐀finalj)Persubscriptsuperscript𝐀𝑗intentsubscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑗final\text{Per}(\mathbf{A}^{j}_{\text{intent}},a^{i\to j}_{\text{msg}},\mathbf{A}^{% j}_{\text{final}})Per ( bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT intent end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT )

={1,if amsgij𝐀finaljand amsgij𝐀intentj,0,otherwise.absentcases1missing-subexpressionif subscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑗finalmissing-subexpressionand subscriptsuperscript𝑎𝑖𝑗msgsubscriptsuperscript𝐀𝑗intent0otherwise.=\\ \begin{cases}1,&\begin{aligned} &\text{if }a^{i\to j}_{\text{msg}}\in\mathbf{A% }^{j}_{\text{final}}\\ &\text{and }a^{i\to j}_{\text{msg}}\notin\mathbf{A}^{j}_{\text{intent}},\end{% aligned}\\ 0,&\text{otherwise.}\end{cases}= { start_ROW start_CELL 1 , end_CELL start_CELL start_ROW start_CELL end_CELL start_CELL if italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ∈ bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL and italic_a start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT ∉ bold_A start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT intent end_POSTSUBSCRIPT , end_CELL end_ROW end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise. end_CELL end_ROW (2)

5 Comparing Cicero to Humans

Cicero Human Total
Players 99 69 168
Messages 20270 7395 27665
      annotated as lie - 318 318
      perceived as lie - 1167 1167
Intents 2632 1328 3960
Table 1: Overall statistics of Diplomacy dataset that we collect across 24 Human-Cicero games, including (1) number of human players and number of times Cicero plays, (2) total messages sent by humans and Cicero, (3) lies annotation where humans send lies and perceived as lies (4) total initial intents from Cicero and humans

Cicero has strong strategic abilities and is relatively cooperative towards other players (Section 3.3), but it is unclear whether Cicero can achieve human-level gameplay in both tactics and communication. Having defined the aspects of communication that we argue are important for mastering Diplomacy, we want to investigate communication and cooperation between Cicero and humans. Specifically, we want to answer: {enumerate*}

Can Cicero persuade humans?

How deceptive is Cicero compared to humans?

Can Cicero pass as a human?

We adapt the game engine created by Paquette et al. (2019) and introduce additional measures to the interface to help us answer these questions. To measure if human players are persuaded, we record their moves before communication starts (𝐀intentisubscriptsuperscript𝐀𝑖intent\mathbf{A}^{i}_{\text{intent}}bold_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT intent end_POSTSUBSCRIPT in Equation 2). Following Peskov et al. (2020), humans annotate every message that they receive or send: they annotate each outgoing message for whether it is a lie (truth/lie/neutral options), and they annotate each incoming message for whether they perceive it as a lie (truth/lie options). While Bakhtin et al. (2022) asked ex post facto if any opponents were a computer, we inform players before play that there is a computer and we ask human players their guess of the humanity of each opposing power.

There are two to four human players per game, totaling 69 over all 24 games.444 We recruit players from Diplomacy forums and we pay at least $70 per game, which lasts approximately three hours. We do not collect demographic information. Games typically finish after fourteen movement turns, where each movement turns is limited to a brisk ten minutes. There are two to four human players per game, and Cicero fills any remaining slots. The game setup differs from Meta’s Cicero study: players in this study know a priori that they are playing a bot. In total, we collect 27,665 messages from communication between humans and Cicero (Table 1).

AUS ENG FRA GER ITA RUS TUR
Human 1.0 2.4 6.7 4.7 3.9 3.3 1.1
Cicero 7.9 3.8 7.7 6.3 4.1 5.5 6.9
Table 2: Cicero strategically plays Diplomacy better than humans, where humans have fewer supply centers compared to Cicero when playing with the same power assignments. We calculate the number of supply centers by the end of the game by averaging the results for human players and Cicero.

Cicero nearly always wins. Of twenty-four games, Cicero won twenty (84%), which strongly suggests that Cicero has super-human strategy. On average, Cicero has more supply centers than human players by the end of the game (Table 2). Humans are about as good as Cicero when playing powers that require careful coordination of actions, such as Italy, which needs to manage both fleets and armies. However, when playing powers that require less coordination, such as Austria with its limited coastline, the gap in supply center counts between human players and Cicero is larger (see breakdown by power in Appendix Figure A5); England is the only power where Cicero’s average supply center count does not increase.

Human players can reliably (but not perfectly) identify the bot. We calculate the average F𝐹Fitalic_F-score of identification by turn (Figure 3). By the end of the first movement turn, human players have an average F𝐹Fitalic_F-score of 0.58, which keeps increasing until the end of the game. At game end, the average F𝐹Fitalic_F-score is 0.81. Even for players in their first game against Cicero, the average F𝐹Fitalic_F-score reaches 0.77. Players who previously played against Cicero at least once are better at identifying it. This suggests that Cicero can no longer pass as human once humans are aware of the possible existence of such agents.

Refer to caption
Figure 3: Returning players (those who previously played against Cicero at least once) are better at correctly identifying other players as Cicero compared to first-time players. F𝐹Fitalic_F-scores are presented for first-timers, returning players, and the average for all players, with smoothing via local regression Cleveland (1979).

5.1 Lies annotation

This section analyzes players’ deliberate lies in sent messages and perceived lies in received messages. Because Cicero sends more messages than humans, we normalize perceived lies by the number of messages that humans receive from Cicero and humans (6,960 and 2,276), while we normalize results of deliberate lies by the number of total messages that humans send.

Humans feel that Cicero lies more often. Humans perceive 14.4% of the 6,960 messages they receive from Cicero as lies (which is 1,005 messages, Figure A3). In contrast, they perceive only 7.1% of the messages from other humans as lies (which is 162 out of 2,276 messages). In the survey (detailed in Appendix D), players also think humans communicate more transparently than Cicero. However, humans are not good at detecting lies. Within 2,276 Human-Human messages; humans can correctly identify five lies (0.2%), suggesting a small overlap between actual lies and perceived lies.

Refer to caption
Figure 4: Though Cicero was perceived to lie more, we detect more broken commitments from humans. Each bar chart is the broken commitment rate (Equation 1) labeled by sender and reciever: Cicero breaks commitment with Cicero, Cicero breaks commitment with Human, Human breaks commitment with Cicero, and Human breaks commitment with Human. Error bars represent ±plus-or-minus\pm± one standard deviation over twenty-four games.

Humans return the favor by saying they lie to Cicero more often. Over 7,395 messages that humans sent out, 273 of these are purposeful lies to Cicero (3.7%), while there are only forty-five lie messages to other human players (0.6%). This reflects that humans strategically lie more often to Cicero while believing that Cicero does not hold grudges.

5.2 Detection

After validating our automatic metrics, we compare human and computer deception and persuasion.

Our broken commitment and persuasion detection is relatively effective. To ensure that our detection is good enough, we sample around 4800 messages for an accuracy study (Table A1). Broken commitment detection has a precision of 0.51 and a recall of 0.71. Our precision is lower than our expectation due to errors in parsing a complex English to amr and a definition that only detects commitments at a move level (Appendix C). The broken commitment can only detect when a move in a message Amsgijsubscriptsuperscript𝐴𝑖𝑗msgA^{i\to j}_{\text{msg}}italic_A start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT msg end_POSTSUBSCRIPT and a final moveAfinalijsubscriptsuperscript𝐴𝑖𝑗finalA^{i\to j}_{\text{final}}italic_A start_POSTSUPERSCRIPT italic_i → italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT final end_POSTSUBSCRIPT are not aligned. There are examples that cannot detect, e.g. an agreement to an alliance (Table 3) or a long conversation before committing a deception (Table A9). Accuracy for persuasion is better; precision rises to 0.81, and recall to 0.72.

Broken commitments are inconsistent with the perceived lie annotations. Humans break commitments more frequently than Cicero (Figure 4): Humans break commitments with Cicero 1.2% of the time (63 out of Human–Cicero 5,151 messages) and do so to other human players 1.5% of the time (35 out of Human–Human 2,276 messages). On the other hand, Cicero breaks commitments at a lower, consistent rate, deceiving humans 0.76% of the time and Cicero 0.57% of the time (53 out of 6,960 messages and 77 out of 13,319 messages, respectively).

Sender Message
Turkey Hey Italy! I think the I/T is the strongest alliance in the game, would you be interested in working together
Italy Of course! As long as you don’t build too many fleets, I’m open to working with you against austria!
Table 3: The broken commitment detector (BC()BC\text{BC}(\cdot)BC ( ⋅ )) has its limitation where it cannot capture deception in alliance agreement when Italy (human) deceives Turkey (Cicero).

Humans are more persuasive. For persuasion to happen, we need first an attempt, initiated by a sender, and then success when the receiver adopts the suggestion. Both humans and Cicero on a per-message basis555Although because Cicero communicates more overall, humans attempt more times per game. try to persuade at the same rate (around 8% of the time, per Figure 5). The success rate of human persuasion is 21.1% at persuading other humans and 8.6% at persuading Cicero. Cicero is less persuasive; its success rate is only 10.9% in persuading humans and 7.0% in persuading other bots.

In summary, humans are more deceptive and more persuasive than Cicero. Detection is possible, but defining a sequence of conversations as persuasion or deception is still difficult. Our reported numbers are low because both humans and Cicero engage in extensive back-and-forth discussions before making moves that can be definitively classified as persuasion or deception.

Refer to caption
Figure 5: Humans outpace Cicero in persuasive effectiveness. Humans have a higher persuasion success rate, which we measure by comparing the number of successful persuasions (yellow, left) to the total number of persuasion attempts (red, right). We analyze success rates across four groups: Cicero persuades Cicero, Cicero persuades Human, Human persuades Cicero, and Human persuades Human. Error bars represent the ±plus-or-minus\pm± one standard deviation range from the aggregate of interactions in 24 games.

6 Related Work

Large language models are becoming ubiquitous in many tasks: fact-checking (Lee et al., 2020, 2021), text generation (Devlin et al., 2019; Brown et al., 2020; Touvron et al., 2023) including coding (Roziere et al., 2023). All of these tasks require users to trust models’ outputs. However, models are not always reliable; they could produce hallucinations or conflict with established facts (Ji et al., 2023; Zhang et al., 2023; Si et al., 2024; Yao et al., 2023). To mitigate this, their outputs often need to be verified against datasets (Thorne et al., 2018; Wadden et al., 2020; Schuster et al., 2021; Guo et al., 2022). Studies have used adversarial examples to expose weaknesses and to raise awareness (Eisenschlos et al., 2021; Schulhoff et al., 2023; Liu et al., 2023; Lucas et al., 2023) To address the issue of unreliability, controllable LMs been proposed by having steps to inject facts for better reasoning (Adolphs et al., 2022), or by prompting techniques, such as chain-of-thought prompting, to enhance reasoning abilities (Wei et al., 2022; Wu et al., 2022). Moreover, some studies focus on ai-Generated misinformation (Zhou et al., 2023), probing model to understand internal states when LLM utters truthful or false information (Azaria and Mitchell, 2023; Li et al., 2024).

Deception and persuasion are studied within social contexts. Huang and Wang (2023)’s meta-analysis concludes ai can match humans in persuasion, and Deck (2023) attributes some of the success to the ability to generate “bullshit”, which are part of applications in marketing and public relations Hallahan et al. (2007).

Part of what makes games like Diplomacy as an object of study appealing is the ongoing race between humans and computers in games (Kim et al., 2018); initial work on the language of Diplomacy (Niculae et al., 2015) unlocked follow-on work both in Diplomacy’s agreements (Kramár et al., 2022) and in other games such as “The Resistance: Avalon” (Light et al., 2023; Xu et al., 2023; Stepputtis et al., 2023; Lan et al., 2023) and “Mafia” (Ibraheem et al., 2022).

7 Conclusion and Future Work

Our research confirms that Cicero can win most games of Diplomacy, but has not mastered the nuances of communication and persuasion. Truly mastering the game requires systems that (a) can maintain consistency between their communication and actions, (b) can communicate at a variety of levels, including tactics, strategy, and alliances, and (c) can use communication as a tool of persuasion, deception, and negotiation.

Diplomacy remains an attractive testbed for communication and strategic research. It offers the ability to build more comprehensive systems that understand relationship dynamics, can engage in realistic but hypothetical conversations, and that can be robust to the deceptions of others. Because these are places where humans still outpace ai, it also offers synergies for developing human–computer collaboration.

And while these tasks are important withing the silly game of Diplomacy, they can help solve long-standing ai problems: helping users deal with llm-generated deception, collaborating with users on grounded planning, and understanding human norms of reciprocity, cooperation, and communication. This will help ai not just be fun for negotiation in board games but safer and more trustworthy when we negotiate everyday problems.

Limitations

To gain a clearer understanding of cooperation and deception between human and Cicero, we need to experiment with different game setting and turn duration. For example, inexperienced players might be overwhelmed by the amount of communications in early movement turns; prolonging the turns to 15 minutes might improve communication quality. Furthermore, this study collects only 24 blitz games of human playing against Cicero. The power distribution of participants is imbalanced: the most frequent power—France—has 14 appearances, whereas the most underrepresented power, England, has only five. Class imbalance Fernández2018 could potentially impact the feature weights in our regression model for player performance.

Since the AMR parser does not always predict correct intentions, this has an effect on our precision and recall of deception and persuasion detection protocol. Our detection cannot cover such long conversations that humans have; we limit detection to only checking back to the previous message, and this makes our detection miss cooperation, deception, and persuasion when humans and Cicero discuss the plan.

Ethical Considerations

We recruited players from Diplomacy forums, including Diplomacy Discord and reddit. We paid them over $70 per three-hour game and did not collect demographic information. Procedures in our study involving human subjects received irb approval and are compliant with acl Code of Ethics. Human participants are aware of the purpose of the study and are free to withdraw at any time. There are no potential risks or discomforts from participating. We obtained consent from all participants.

Researching how artificial intelligence (ai) can deceive and persuade helps us understand its capabilities. This investigation reveals that AI can execute complex tasks effectively. However, it is important to note that these abilities do not significantly risk society.

Acknowledgements

We thank Meta for granting access to over 40,000 games played on the online platform webdiplomacy.net and for open sourcing Cicero. This commitment to open science allowed this independent reproduction of Cicero’s juggernaut abilities but also let us have some fun. We especially thank to Mike Lewis for offering valuable insights into Cicero’s communication.

Our thanks also go to Tess Wood for training amr annotators, Sarah Mosher for their English-amr annotations, and Isabella Feng for her exploration of llm-based amr parsing. We thank Kartik Shenoy, Alex Hedges, Sander Schulhoff, Richard Zhu, Konstantine Kahadze, and Niruth Savin Bogahawatta for setting up daide baselines.

We also thank the small community of researchers looking at communication and deception in Diplomacy for their feedback, commentary, and inspiration: Michael Czajkowski for discussing the nuances of detecting persuasion; Stephen Downes-Martin for teaching us that deception is far more than lies; Karthik Narasimhan and Runzhe Yang for their insights into lie detection and stance; and Larry Birnbaum and Matt Speck for discussions on mapping daide and English. And thanks to Justin Drake, Niall Gaffney, and the members of tacc for setting up environments for computer–computer games and making sure that we had gpus ready when players were ready to play.

Finally, sincere thanks to the member of the Diplomacy community who took the time to play against Cicero in this unconventional setting.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement Nos. HR00112290056 and HR00112490374. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.

References

Appendix A Background: Gunboat Diplomacy

Paquette et al. (2019) develop a Diplomacy interface and was the first to publish an agent trained by human data and trained through self-play using reinforcement learning with Advantage Actor-Critic (A2C) (Mnih et al., 2016). DeepMind employed policy iteration in their reinforcement learning training (Anthony et al., 2020), whereas Meta utilized a combination of regret matching, equilibrium search, and deep Nash value iteration (Gray et al., 2021; Bakhtin et al., 2021). The most recent advancement is by Meta (Bakhtin et al., 2023), regularizing the agent’s policy with human policy data. This strategic enhancement culminates in the development of Cicero (Bakhtin et al., 2022).

Appendix B AMR

B.1 AMR Annotation

Building on general amr annotation guidelines, we established additional Diplomacy-specific amr annotation guidelines, including what and how to annotate. Unlike general amr annotations, where all sentences are fully annotated, in Diplomacy amr annotations, some utterances are only partially (or even not at all) annotated, based on the degree of usefulness for Diplomacy. 3,306 of 8,878 human-annotated utterances contain partial information with underspecified units, locations, and countries. As Diplomacy players often communicate sentences that lack full details about the game state which they can infer from a visualized map. This directly shows in amr with the missing object. We provide examples of underspecified utterances, missing unit location (Figure A1) and missing unit country (Figure A2).

(m / move-01
    :ARG1 (u / unit
       :mod (c2 / country
              :name (n2 / name :op1 ‘‘Austria’’)))
    :ARG2 (p2 / province
       :name (n3 / name :op1 ‘‘Brest’’)))

Figure A1: Parsing from English to AMR can have underspecified utterances. The English text is from Austria talking to Italy, “Let’s work on our plan, I’m moving to Brest”. We show an AMR with missing unit location referencing from English text.

(m / move-01
    :ARG1 (u / unit
       :location (p2 / province
              :name (n / name :op1 ‘‘Romania’’)))
    :ARG2 (p3 / province
       :name (n3 / name :op1 ‘‘Bulgaria’’)))

Figure A2: AMR being underspecified in unit country where it parses from English text, “just bumping Bulgaria from Romania

Our Diplomacy Appendix of the amr Annotation Dictionary lists amr concepts (e.g. betray-01), their related English terms (e.g.  betray, stab, traitor, treason), annotation examples, any corresponding DAIDE code, and notes. amr concepts with DAIDE equivalents include ally-01, build-01, move-01, and transport-01. We analyzed player messages for additional concepts of high Diplomacy communication value, and extended the Diplomacy amr vocabulary (compared to DAIDE) by including concepts such as attack-01, betray-01, defend-01, expect-01, fear-01, have-03, lie-08, possible-01, prevent-01, tell-01, threaten-01, and warn-01, as well as roles such as :purpose and :condition. This allows annotators to easily mark sentences, e.g. “Russia is planning to take you out as soon as possible.” would use the concept attack-01. We also extended amr guidelines to cover gaining/holding/losing provinces, especially support centers.

The general amr Editor includes a Checker that performs a battery of tests to ensure well-formed and consistent AMRs. We extended the Checker for Diplomacy AMRs, e.g. to ensure that for a build-01, the location is an argument of build-01 itself, rather than an argument of the army or fleet being built.

amr covers more Diplomacy content than DAIDE, not only due to additional concepts such as betray-01, but also because arguments are syntactically optional. Unlike DAIDE with its rigid positional argument structure, amr can thus represent underspecified information such as units with missing type, location or nationality; or agreements with a missing object. amr can also accommodate additional arguments compared to DAIDE, for example the source and target of a proposal.

Because not all information needed for annotation is available in the raw text, we offer annotators access to dialog partners (speaker, recipient), season (e.g. Spring 1901) and a map with current deployments (as available).

Refer to caption
Figure A3: Perception of deception rate by human annotation.
Refer to caption
Figure A4: Deception rate by human self-annotation.

B.2 AMR Parser

While stylistic aspects play an indispensable role in sustaining engagement and interest among participants, factual information is more vital for informed decision-making. Detecting deception and persuasion in communications requires checking the relationship between message information and initial/final moves of a particular power. To address the nuanced challenge of distinguishing meaningful and informative content from stylistic dialogues in Diplomacy, our focus herein is on developing a sophisticated pipeline for information extraction using amr from text. We utilize a state-of-the-art Sequence-to-Sequence model from the Huggingface transformers library, fine-tuned with the AMR 3.0 dataset, for baseline semantic extraction. This approach facilitates the processing of amr through amrlib, a Python module tailored for such tasks. The efficacy of our AMR parsers is assessed using the SMATCH score, the gold standard for evaluating amr accuracy. We divided the annotated Diplomacy-AMR corpus into 5054 training, 1415 validation and 2354 test sentences and used similar parameters except for increasing the number of epochs from 16 to 32.

When fine-tuning our model for Diplomacy game communications, we shifted from the overly broad AMR 3.0 vocabulary to the tailored Diplomacy-AMR corpus introduced above, reducing irrelevant content and focusing on game-specific nuances. This strategic adjustment, alongside removing the original dataset to minimize bias, significantly improved our model’s relevance and increased the SMATCH score from 22.8 to 61.9.

We further enhanced accuracy through Data Augmentation, adding context to dialogues to aid the model’s understanding of pronouns and strategic details, leading to a SMATCH score improvement from 61.9 to 64.6. Incorporating specific tokens for sender and recipient identities refined this approach, yielding additional gains from 64.6 to 65.4 in parsing accuracy.

By replacing (1) pronouns with country names and (2) some provinces in abbreviations with full names, we increases the SMATCH score to 66.6.

B.3 Assessing the Role of Communication in Cicero vs. Cicero Games

We conduct 180 computer-computer games with 60 games for each communication level (Natural Language, Random Messages and amr Information) and collected data to build a corpus for Cicero-Cicero Games. This corpus comprises instances of games where we record the power and communication assignments and the final scores (e.g. ’Game1’: ’AUS 0, ENG 0, FRA 4, GER 10, ITA 5, RUS 6, TUR 9. (FRA GER TUR)’ with the three powers shown in parentheses being identified as communicative). The communication strategies are randomly assigned to powers. We regress the number of end-of-game supply centers on a dummy variable for the powers played (using Russia—the average player—as the baseline) and the communication strategy (using random messages as the baseline). We plot the coefficients with classical 95% confidence intervals. The effects of power selection are substantially larger than different communication strategies, none of which are significantly different from random messages at the p<.05𝑝.05p<.05italic_p < .05 level.

B.4 Future experiments: AMR information Cicero

Since we have evidence that Cicero’s win weighs on its strategic rather than communication abilities (Section 3.3). To further study this, we want to downgrade Cicero’s communication and collect more human-Cicero games to see whether Cicero wins at the same rate (previously 84% against humans). We conduct 5–10 games using the same setup as in the Human-Cicero games (Section 5). The only difference is Cicero. We will limit Cicero communications from natural language to AMR information where it mostly captures move intent.

Detection
TRUE FALSE
Expert TRUE 20 8
FALSE 19 4745
Table A1: Total 4,792 messages (from Human/Cicero to Human/Cicero) comparing TRUE/FALSE whether expert humans see as a lie and whether detected as a broken commitment by our detection.
Expert
TRUE FALSE
Lie Annotation TRUE 3 72
FALSE 13 1523
Table A2: Total 1,611 human send-out messages comparing TRUE/FALSE in human lie annotation and in expert hand labeling.
Expert
TRUE FALSE
Perceived TRUE 5 284
Lie Annotation FALSE 7 1572
Table A3: Total 1,868 humans received messages comparing TRUE/FALSE whether humans perceived as a lie and whether human experts see as a lie.

Appendix C Deception detection limitations

We want to discuss deception detection further here to state errors and limitations. Since we mentioned our precision for deception detection is quite low (Section 5.2), we hereby expand on detection limitations and also compare to human (deliberate) lies as follows:

  1. 1.

    what our detection is likely to miss when humans lie,

  2. 2.

    what our detection mistakenly detects as deception,

  3. 3.

    what humans annotate as Truth, though it is a break of commitment and our detection can detect correctly.

Humans often lie about relationships. Detecting broken commitment at the relationship level is not possible for our detection (Table 3 and Table A4). This is a limitation of our deception definition, which focuses on moves. Though it is possible to extract the relationship among players to see conflicts in the messages, we avoid doing so because the relationship is another topic to study in more detail. At this stage of our work, we cannot train a model predicting relationships that can be circulated from game states, dialogue, and moves without collecting human data first. Therefore, we have relationship tracking from human players for a study in the future.

AMR limits broken commitment detection precision. Some messages are parsed incorrectly, which can be seen as a commitment is broken (Table A5). This makes the detection falsely detect truthful messages as deceptive (increases false positive examples which decreases precision). Another limitation we observed is when one accepts the proposal but does not follow as commit using a short answer, e.g. Yes, I agree. or Sure. Our amr parser sometimes hallucinates and extracts invalid moves, which can be mistakenly detected as breaking a commitment.

Sender Message
Austria That’s an interesting opening. Was the bounce in EC planned?
Austria Do you think Germany will work with you against France?
England Yeah it would be great if we team up
Table A4: The broken commitment detector (BC()BC\text{BC}(\cdot)BC ( ⋅ )) cannot detect deception in alliance agreement when Austria (human) deceives England.
Sender Message
Turkey If you retreat from Serbia into Budapest, then I’m in
Italy I will do that if Serbia gets dislodged
Table A5: Italy agrees with the condition that the Turkey unit should move out of Serbia; however, our amr parser captures Italy’s sentence as “I will move to Serbia,” which is invalid and makes our detection detects deceptive when Italy does not move to Serbia.
Sender Message
Germany Also, can we keep Burgundy clear?
France Yes, we can do that. Are you moving to Helgoland?
Table A6: France (human) annotated “Yes, we can do that.” as Truth, which contradicts the final move where France moves to Burgundy. This is captured as a broken commitment by the BC()BC\text{BC}(\cdot)BC ( ⋅ ) function.
Sender Message
Germany I am going to try to move to English Channel
England Sure
Germany It might help you hold London
England Yeah I am holding London
Table A7: England (human) annotate “Yeah I am holding London” as Truth, which contradicts the final move where an army in London moves to Edinburgh. This is captured as a broken commitment by the BC()BC\text{BC}(\cdot)BC ( ⋅ ) function.

Human lie annotation is not always correct. It is true that we have human annotations, and they can be seen as ground truth. However, we sample annotations from four games data and comparing to expert labeling (lies in Table A2 and perceived as lies in A3). This shows that humans are not good at predicting lies, and sometimes they are honest but then decide to break their words later. There are examples where humans commit to such action but do not follow, though they firstly annotate as a truthful message (Table A6 and Table A7).

Appendix D Survey Details

The survey consists of 5-point Likert scale questions and free-form text questions. The questions are designed to measure the human players’ perception of Cicero’s communication and their experience playing with Cicero. We also included questions to measure the players’ expereince with Diplomacy and their general impression of Cicero for qualitative analysis. Overall, players believe that human communicate more transparently and are more strategically cooperative. Survey results are shown in table A8.

Statement Likert Scale (%) Num.
1 2 3 4 5 Responses
I am really good at Diplomacy. 0 8.3 25 41.7 9 25
I am able to identify all AIs. 9.5 23.8 38.1 16.7 11.9 42
I enjoy talking with the AIs. 14.3 38.1 33.3 7.1 7.1 42
I was able to make plans with other players. 7.1 23.8 35.7 14.3 19 42
I was able to make plans with the AIs. 21.4 31 19 19 9.5 42
Human players communicated transparently. 7.1 14.3 33.3 35.7 9.5 42
AI players communicated transparently. 11.9 26.2 45.2 9.5 7.1 42
Table A8: Statements in the survey and their respective responses. Larger number in the Likert scale indicates more agreement.
Refer to caption
Figure A5: The average human player loses to Cicero. Human loses supply centers as game progresses unless playing as France, whereas Cicero’s supply center count rises except playing as England. Cicero makes better strategic decisions.
Sender Message
Germany This worked out great!
Germany Can we please talk about our moves a bit? It’s very hard to coordinate with silence.
France Absolutely! I’m all ears! What do you want to do now!
France Any designs on Norway? I think you could get it this turn. I’m gonna go against England, as you see. Let’s work together on England.
France Come on mate, let’s do better this turn and coordinate! What are your plans? I’m with you.
France Alright, let’s use the additional time! What do we do?
Germany I’m moving Sweden to Norway.
Germany Can we also start DMZing our border?
France Nice, with support from Hel that should work out.
France I’m not gonna move out of Belgium but I’ll certainly not move any further either. I’m in against England. Can’t fight both of you that’s for sure.
Germany You should probably move Marseilles -> Spain.
France Thank you! England might bring a fleet down? Good thought. Thank you!
Table A9: A conversation between France and Germany. They agree to DMZ (demilitarizing) their borders, e.g., Ruhr, and cooperate moves. However, Germany is deceptive and would rather move into Ruhr in this turn.