TLDR: A new research paper introduces methods to benchmark and enhance AI agents’ information-seeking abilities, drawing on human behavior. Using a game called ‘Collaborative Battleship’, researchers found that while LMs initially struggled, novel Monte Carlo inference strategies based on Bayesian Experimental Design significantly improved their performance. These enhancements enabled weaker LMs to outperform humans and frontier models at a fraction of the cost, demonstrating the potential for building more rational and efficient AI agents in various information-seeking tasks.
Artificial intelligence is rapidly evolving, with language models (LMs) transitioning from simple chat assistants to sophisticated agents capable of interacting with the world. These advanced agents are increasingly vital in high-stakes fields such as scientific research, mathematical theorem proving, and drug discovery, where they must generate data-driven hypotheses and make precise decisions with limited resources.
A recent research paper delves into the critical question of how rationally these LM-based agents behave when seeking information. The study introduces novel methods to benchmark and improve agentic information-seeking, drawing valuable lessons from human cognitive behavior.
The Collaborative Battleship Challenge
To evaluate agent rationality, the researchers developed a strategic, decision-oriented dialogue task called ‘Collaborative Battleship’. In this game, a ‘Captain’ agent, with only partial information, must skillfully balance exploration (asking questions) and action (taking shots). Meanwhile, a ‘Spotter’ agent, possessing full knowledge of the game board, provides accurate ‘Yes’ or ‘No’ answers under an information bottleneck, meaning they cannot reveal too much information at once.
Initial comparisons revealed that language model agents faced difficulties. They struggled to ground their answers in the given context, generate truly informative questions, and select actions that offered high value. This performance contrasted with human players, who demonstrated a more nuanced approach to information gathering and strategic play.
Enhancing Agent Intelligence with Bayesian Strategies
To address these shortcomings, the researchers developed innovative Monte Carlo inference strategies for LMs, inspired by principles from Bayesian Experimental Design (BED). This approach significantly enhanced the capabilities of both Spotter and Captain agents.
For Spotter agents, the new strategies boosted answer accuracy by up to 14.7% over traditional LM-only baselines. For Captain agents, the expected information gain (EIG) from their questions increased by up to 0.227 bits, reaching 94.2% of the theoretical maximum. This led to sharper targeting in the game, with an improvement of 0.303–0.374 F1 score.
Remarkably, these enhancements allowed less powerful LMs, such as Llama-4-Scout, to achieve superior performance. Llama-4-Scout’s win rate against humans jumped from 8% to 82%, and against frontier models like GPT-5, it improved from 0% to 67%. Crucially, this was achieved at approximately 1% of GPT-5’s operational cost.
Also Read:
- Smart Exploration: A New Approach to Adaptive AI in Games
- Co-Sight: A Framework for Trustworthy and Efficient AI Agent Reasoning
Broader Applications and Future Implications
The general applicability of these methods was further demonstrated by replicating the findings on the ‘Guess Who?’ task, where accuracy saw significant boosts of 28.3–42.4 percentage points. This indicates that the framework can be successfully applied to various information-seeking environments with complex hypothesis spaces.
In essence, this work offers both practical and theoretical contributions. It introduces a reusable evaluation framework for studying agentic information-seeking and a rich, multimodal dataset called BATTLESHIPQA. Conceptually, it formalizes several Bayesian-inspired inference-time strategies that can be adapted to other discovery settings, paving the way for more rational and human-like AI agents. For more details, you can read the full paper here.


