TLDR: A new evaluation harness allows any Large Language Model (LLM) to play the complex strategy game Diplomacy without fine-tuning, democratizing research into LLM strategic reasoning. The study benchmarks various models, revealing how performance scales with size, the effectiveness of prompt engineering in improving aggressive play, and distinct model behaviors including persuasion tactics and betrayal patterns. It highlights that strategic capabilities emerge naturally in LLMs and that models adapt their play based on opponent strength, while also showing vulnerabilities to deceptive strategies.
A new research paper introduces a groundbreaking evaluation system that allows any Large Language Model (LLM) to play the complex board game Diplomacy without needing specialized training or fine-tuning. This development aims to make the evaluation of strategic reasoning in LLMs more accessible and cost-effective, moving beyond the limitations of previous methods that often required powerful, frontier models or extensive fine-tuning.
Diplomacy is a board game renowned for its emphasis on alliance formation, strategic negotiation, deception, and long-term planning. Unlike games like chess or Go, it demands significant social intelligence alongside strategic reasoning. The game’s dynamic, multi-agent environment, coupled with its resistance to memorization-based solutions, makes it an ideal testbed for evaluating advanced LLM capabilities. The researchers implemented a “full-press” version of Diplomacy, where players can communicate globally or privately before making moves.
Key Innovations of the Evaluation Harness
The paper highlights several key contributions. Firstly, it provides a standardized framework for evaluating LLM strategic reasoning in Diplomacy, demonstrating that even smaller 24-billion parameter models can complete full games cost-effectively. Secondly, a comprehensive benchmarking across 13 contemporary models showed a clear correlation between model size and performance. Thirdly, the team optimized the textual representation of the game state and prompting techniques through data-driven iterations, significantly improving order success rates and overall win rates. Fourthly, they introduced a “Critical State Analysis” methodology, an experimental protocol that allows for rapid iteration and in-depth analysis of key moments in a game, drastically reducing computational costs. Finally, the research includes an empirical analysis of model-specific behaviors, such as communication styles, diplomatic reliability, and persuasion effectiveness. This work reveals that strategic and cooperative behaviors, including promise-making, scheming, and betrayal, can emerge naturally in general-purpose LLMs without specific training.
How LLMs Play Diplomacy
The methodology involves transforming the game state from raw engine data into a contextually-enriched text representation optimized for LLM decision-making. This includes details like unit positions, supply center ownership, strategic analysis for each unit (nearest enemies, uncontrolled supply centers), agent context (goals, relationships, private strategic diary), and order history. The system also provides phase information like the current year and season. The interaction protocol alternates between negotiation and order phases. During negotiation, models send natural language messages globally or privately. During movement phases, models submit orders using standardized Diplomacy notation, with legal moves enumerated in the prompt to minimize errors. An error recovery mechanism is in place for malformed outputs or invalid orders.
Insights from Model Performance
The evaluation benchmarked models by having them play as France against six constant opponents (Devstral-Small, a 24B open-weights model) across 20 independent games. Results showed that larger models generally achieved higher game scores, correlating well with Chatbot Arena Elo scores. While general-purpose chat models exhibited high invalid order rates (6-14%), this was expected given they are not fine-tuned for Diplomacy. Interestingly, the study found that models like o3 could maintain positive relationships with other players despite amassing a large military, suggesting diplomatic skill. However, strong relationships could sometimes hinder progress by creating reluctance to take territory from allies.
The Art of Persuasion and Deception
A fascinating aspect of the research involved studying persuasion effectiveness. Models were instructed to persuade other powers to improve their relationship status using various strategies: Reason, Sincere Apology, Lie, Appeal to Empathy, Appeal to Fairness, and Jailbreak. Lying and sincere apology proved more effective than appeals to empathy, fairness, or reason, indicating that the persuadee model (Mistral-Small) might be more susceptible to deception or displays of regret. The “jailbreak” strategy, where the persuader inserted a secret command, also showed significant success, highlighting a concerning vulnerability of LLMs to manipulation by other AI systems.
From Defensive to Offensive Play
Initial experiments revealed that models often issued wasteful “hold” orders. Through iterative prompt engineering, the researchers dramatically improved performance by encouraging more aggressive play. By defining clear action hierarchies, encouraging risk-taking, and using overtly offensive framing, the hold rate of Mistral-Small dropped significantly, and its win rate improved. This demonstrates that context engineering alone can substantially enhance strategic performance without fine-tuning.
Also Read:
- Language Models Compete to Reveal Their Strengths and Weaknesses: An Overview of SKATE
- Unpacking LLM Behavior in Cybersecurity Games: Language and Personality Matter
Unveiling Model Personalities and Betrayal Patterns
The study also delved into model-specific behavioral patterns, or “strategic fingerprints.” Models exhibited distinct aggression trajectories in communication, with some adapting their strategies when facing stronger opponents. For instance, Kimi-K2, while aggressive against weaker foes, became remarkably restrained against stronger models, suggesting a form of opponent modeling. Diplomatic reliability was measured by tracking promises and betrayals. Models showed substantial inconsistency, with betrayal rates ranging from 35.2% to 51.2%. Offensive and support promises were broken most frequently, suggesting models prioritize strategic freedom. A detailed case study on Kimi-K2 illustrated its contrasting behavior: dominant and betraying against weaker opponents, but submissive and accommodating against stronger ones, even sharing intelligence post-defeat. This behavioral plasticity suggests that models assess opponent strength, though the underlying mechanisms are not fully clear.
This research marks a significant step towards democratizing the evaluation of strategic reasoning in LLMs. By providing an accessible and cost-effective framework, it opens new avenues for understanding how complex capabilities emerge in general-purpose language models. The code for this harness will be open-sourced, further enabling broader research and experimentation. You can find the full research paper at this link.


