Unpacking AI Negotiation: How Language Models Reason, Perform, and Cost Across Cultures

TLDR: A new study evaluates how reasoning capabilities impact the negotiation performance and computational cost of large language models (LLMs) across English, German, and Italian. It finds that enabling reasoning significantly improves negotiation outcomes but at a substantial cost. Commercial LLMs maintain language consistency in their internal reasoning, while open-weight models often switch to English. The research highlights that reasoning fosters genuine strategic adaptation, moving beyond simple pattern matching, and identifies key trade-offs between performance and cost.

Negotiation is a complex human skill, requiring strategic thinking, understanding of others’ intentions, and a delicate balance between cooperation and competition. As large language models (LLMs) are increasingly deployed as autonomous agents in various real-world scenarios, their ability to negotiate effectively becomes crucial. A recent comprehensive study delves into this very challenge, systematically evaluating how reasoning capabilities influence LLMs’ negotiation performance and associated costs across multiple languages.

The research, titled “The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models,” was conducted by Sherzod Hakimov, Roland Bernard, Tim Leiber, Karl Osswald, Kristina Richert, Ruilin Yang, Raffaella Bernardi, and David Schlangen. It addresses two significant gaps in previous research: the systematic investigation of reasoning’s impact on negotiation performance and computational cost, and the exploration of multilingual negotiation capabilities.

The Challenge of AI Negotiation

Previous studies have shown that LLMs often struggle with optimal play in negotiation, sometimes losing to weaker opponents or failing in cooperative tasks. They can exhibit deceptive tactics, express desperation, or even take economic risks. This highlights the need for a deeper understanding of how LLMs make strategic decisions in interactive, multi-turn scenarios.

Methodology: Three Dialogue Games

To thoroughly evaluate LLM negotiation abilities, the researchers implemented three distinct dialogue games in a self-play setup, where two instances of the same LLM played against each other:

Deal or No Deal (DoND): A multi-issue bargaining game testing preference expression, understanding, and compromise. Players negotiate over items with different private values.
Clean Up: A cooperative game focused on strategic development and object rearrangement on a grid, requiring spatial reasoning and coordinated actions.
Air Balloon Survival: An advanced game evaluating reasoning and interactive collaboration. Players must agree on items to discard from a sinking hot air balloon to reduce weight, maximizing combined utility based on hidden preferences. This game explicitly allowed for “strategic reasoning” traces to be generated by the models.

These games were conducted in English, German, and Italian, using both commercial models (GPT-5, GPT-5-mini, Claude-4) and open-weight models (Llama3.3-70B, Deepseek-R1-distilled-llama-70B, Nemotron-Nano-9B-v2, Qwen-3-80B, GPT-OSS-120B, Deepseek-v3.1).

Key Findings: Reasoning’s Impact and Multilingual Nuances

The study yielded several critical insights:

1. Reasoning Significantly Boosts Performance, But at a Cost: Enabling reasoning (scaling test-time compute) dramatically improved negotiation outcomes across many models and languages. For instance, Qwen-3 saw a 56-point gain, and GPT-5’s performance improved by 31.4%. However, this came with a substantial computational cost. GPT-5’s cost increased by nearly 400% when reasoning was enabled, making it the most expensive model to run in reasoning mode. GPT-5-mini and GPT-OSS were identified as more cost-efficient options among commercial and open-weight models, respectively.

2. Multilingual Reasoning Distinction: A significant finding was the difference in language consistency. Open-weight models consistently switched to English for their internal reasoning steps, even when negotiating in German or Italian. This could impact the explainability of their reasoning processes. In contrast, leading commercial models like Claude-4 maintained language consistency between their internal reasoning and final output, thinking in the language of the task.

3. Strategic Adaptation vs. Surface-Level Pattern Matching: The research suggests that reasoning enables genuine strategic adaptation rather than mere pattern matching. Models with reasoning showed improved handling of complex rules, better value-based decisions, and enhanced collaborative outcomes. Analysis of “reasoning loops” (repeated actions or thoughts) showed that good-performing models rarely displayed such loops, indicating more goal-oriented planning. Role awareness – understanding one’s own role as a player and the existence of a counterpart – was also found to be a prerequisite for consistently high scores.

4. Performance Across Models: GPT-5 emerged as the top performer, closely followed by GPT-5-mini and Claude-4. Qwen-3 showed the most significant performance jump when reasoning was enabled.

Also Read:

The Price of Thought

The study concludes that while scaling test-time compute through reasoning is a powerful tool for enhancing negotiation performance in LLMs, it comes with a considerable computational expense. The multilingual aspect reveals a crucial difference between commercial and open-weight models regarding language consistency in internal thought processes. These findings pave the way for developing more versatile and strategically adaptive AI agents in the future.

For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI Negotiation: How Language Models Reason, Perform, and Cost Across Cultures

The Challenge of AI Negotiation

Methodology: Three Dialogue Games

Key Findings: Reasoning’s Impact and Multilingual Nuances

The Price of Thought

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Upwork Study Reveals AI Agents Thrive with Human Collaboration, Struggle Alone

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates