Large Language Models Emerge as Adaptable Teammates for Human-AI Collaboration

TLDR: A new study explores how Large Language Models (LLMs) can act as ‘policy-agnostic human proxies’ to simulate human decision-making in heterogeneous AI teams, overcoming the limitations of costly human-in-the-loop data. Through experiments in a Stag Hunt-inspired grid-world game, researchers demonstrated that LLMs can align with expert judgments, adapt their behavior to exhibit risk-averse or risk-seeking strategies based on simple prompts, and generate multi-step action sequences that mimic human movement patterns. This approach offers a scalable and customizable method for integrating human-like flexibility into AI agents for complex collaborative tasks.

In the evolving landscape of artificial intelligence, a significant challenge lies in enabling AI systems to effectively collaborate with human teammates, especially when the human’s decision-making process is complex, unpredictable, or unknown to the AI. Traditional methods for training AI in such ‘heterogeneous-agent teams’ often rely on extensive and costly human-in-the-loop data, which can limit how widely these systems can be deployed.

A recent research paper, LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams, by Aju Ani Justus and Chris Baber from the School of Computer Science, University of Birmingham, proposes an innovative solution: using Large Language Models (LLMs) as ‘policy-agnostic human proxies’. This means LLMs can generate synthetic data that mimics human decision-making without needing to understand the underlying rules or ‘policies’ governing human actions. This approach offers a scalable way to simulate human behavior, potentially overcoming the limitations of traditional data collection.

The Challenge of Human-AI Collaboration

Multi-Agent Reinforcement Learning (MARL) has achieved remarkable success in cooperative multi-agent systems, even surpassing human performance in complex games. However, these methods often fall short when humans are involved in heterogeneous teams. AI agents trained through self-play can exhibit rigid behaviors, forcing humans to adapt to the AI rather than the other way around. This highlights a critical gap: MARL agents struggle to adapt to teammates whose preferences, strategies, or cognitive constraints are unknown or unobservable, like humans.

Existing solutions, such as Reinforcement Learning with Human Feedback (RLHF) or Human-in-the-Loop Reinforcement Learning (HITL RL), require significant human input, making them expensive and labor-intensive. LLMs, with their ability to synthesize human-like decisions across various domains, present a promising alternative for generating training data.

A Stag Hunt Game for Evaluation

To evaluate the effectiveness of LLMs as human proxies, the researchers conducted three experiments within a grid-world capture game inspired by the ‘Stag Hunt’ paradigm. This game theory concept involves a trade-off between a high-value target (stag) that requires cooperation to capture, and lower-value individual targets (hares) that can be captured alone. The game is played on a 5×5 grid with two hunters (blue and purple agents), one stag, and two hares. Agents observe their environment and choose to target either the stag or a hare.

Crucially, the environment’s state was described to the LLMs not through visual snapshots or coordinates, but by summarizing the relative distances between objects (e.g., distance between the blue hunter and the nearest hare). This simplified representation was chosen to focus the LLM on strategic decision-making rather than complex spatial interpretation.

Experiment 1: Aligning with Expert Decisions

The first experiment aimed to see if LLMs could replicate decisions made by expert judges when given full visibility of the environment. Using 15 grid configurations, the decisions of Llama 3.1 8B, Mixtral 8x22B, and Llama 3.1 70B models were compared against 30 human participants and 2 expert judges. The LLMs were prompted with game state observations and reward structures, and asked to choose ‘Stag’ or ‘Hare’.

The results showed that larger models, Llama 3.1 70B and Mixtral 8x22B, closely aligned with expert judgments, achieving F1-Scores around 0.80 and Cohen’s Kappa scores around 0.60. This significantly outperformed human participants, who had a Kappa score of only 0.07. This demonstrated that LLMs, especially larger ones, could consistently apply underlying decision criteria and serve as reliable stand-ins for human experts in cooperative tasks.

Experiment 2: Inducing Human-like Variability and Risk Sensitivity

The second experiment explored whether LLMs could generate decisions reflecting human-like response variability and risk sensitivity. Prompts were modified to include a ‘risk behavior modifier’, instructing the LLM to be either ‘risk averse’ or ‘risk seeking’. For example, a risk-averse prompt might lead the LLM to prioritize capturing a hare (lower risk, lower reward) over a stag (higher risk, higher reward).

The findings indicated that both Llama 3.1 70B and Mixtral 8x22B could simulate these different risk profiles with minor prompt adjustments. Llama 3.1 70B tended towards neutral or cooperative behavior by default, while Mixtral 8x22B showed a more risk-averse baseline but adapted effectively when guided by prompts. This flexibility suggests that LLMs can be steered to reflect diverse risk profiles relevant to team coordination, making them more versatile human analogues.

Experiment 3: Simulating Dynamic Decision-Making

The final experiment tested LLM agents in a dynamic grid-world where they had to generate movement actions over multiple steps. The LLM (acting as the Blue Hunter) was queried at each state to decide its next action (e.g., UP, DOWN, LEFT, RIGHT, STAY), while the Purple Hunter followed a predefined script. The goal was to see if LLM agents could produce coherent multi-step action sequences resembling human paths.

The results showed that the trajectories generated by both LLMs (Llama 3.1 70B and Mixtral 8x22B) exhibited human-like decision-making. While not always identical to specific human paths, the LLM-generated trajectories demonstrated clear intent and goal-oriented behavior, often mirroring human strategies, especially under cooperative prompts. This suggests LLMs can function as effective policy-agnostic agents-in-the-loop, generating data suitable for training imitation models.

Also Read:

Conclusion: LLMs as Scalable Human Proxies

The research concludes that LLMs can serve as effective human proxies and policy-agnostic teammates in heterogeneous multi-agent reinforcement learning. They can align with expert decisions, adapt their behavior based on prompt modifications to reflect different risk sensitivities, and generate plausible multi-step decision trajectories that resemble human actions. Unlike prior work that trains specialized models, this approach is ‘policy-agnostic’, meaning LLMs act directly from textual prompts rather than pre-trained policies, making the design process low-effort and amenable to automation.

While the study focused on a single grid-world task, the findings lay a strong foundation for future work, including extending the approach to more complex environments, integrating LLM-agents into RL training pipelines, and comparing their behavior directly with human participants in real-time heterogeneous human-AI teams. This line of research aims to foster seamless human-AI collaboration by endowing agents with the flexibility and nuance of human decision-making.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Large Language Models Emerge as Adaptable Teammates for Human-AI Collaboration

The Challenge of Human-AI Collaboration

A Stag Hunt Game for Evaluation

Experiment 1: Aligning with Expert Decisions

Experiment 2: Inducing Human-like Variability and Risk Sensitivity

Experiment 3: Simulating Dynamic Decision-Making

Conclusion: LLMs as Scalable Human Proxies

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates