TLDR: The GuessingGame research introduces a novel protocol to evaluate Large Language Models (LLMs) as strategic question-askers in open-ended, open-domain settings. It uses a Guesser LLM to identify a hidden object by asking free-form questions to an Oracle LLM. The paper proposes two information gain (IG) metrics—a Bayesian belief-tracking method and an entropy-based method using ConceptNet—to measure question quality. Findings show that open-ended questions and specific question types (like ‘Attribute’) are more informative, and higher IG strongly predicts game efficiency. Crucially, simple prompting strategies can significantly improve LLM question-asking performance, demonstrating that this capability is both measurable and improvable. The Bayesian IG metric also correlates strongly with human performance, suggesting its general utility in assessing question informativeness.
Large Language Models (LLMs) have shown impressive capabilities in understanding and generating human-like text, excelling in tasks like factual recall and multi-turn dialogue. However, their ability to strategically ask questions, especially in open-ended and open-domain scenarios, has been less explored. This is a crucial area for interactive applications such as education, medical diagnosis, and autonomous decision-making, where knowing what to ask can be as important as knowing how to answer.
A new research paper introduces a novel evaluation protocol called GuessingGame to address this gap. The protocol aims to measure how effectively LLMs can act as strategic question-askers in identifying a hidden object without predefined choices or candidate lists. The full paper can be accessed here: GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models.
How GuessingGame Works
The GuessingGame involves three distinct LLM agents: an Oracle, a Guesser, and a Checker. The Oracle holds a secret physical object. The Guesser’s task is to identify this hidden object by asking free-form questions to the Oracle. The Checker ensures that the Guesser’s questions adhere to any experimental constraints and classifies the type of question being asked. A game proceeds in alternating turns, ending when the Guesser correctly identifies the object or after a maximum of 50 turns.
To evaluate performance, the researchers use two primary metrics: Success Rate (SR), which measures the proportion of games where the Guesser correctly identifies the object, and Average Number of Questions (ANQ), indicating the efficiency of successful games.
Measuring Question Quality with Information Gain
Beyond just success and efficiency, the paper introduces two innovative Information Gain (IG) metrics to quantify how much each question-answer pair reduces uncertainty:
-
Bayesian Belief Update: This method tracks belief updates over semantic concepts. An Interpreter LLM scores the relevance of concepts implied by the Oracle’s answer, and these scores are used to update a probability distribution over potential concepts. This approach is flexible and doesn’t require a fixed knowledge base.
-
Entropy-Based Information Gain: This method uses ConceptNet, a large commonsense knowledge graph, to filter candidate objects based on assertions implied by the Oracle’s answer. Information gain is then measured as the reduction in entropy (uncertainty) over the set of possible objects.
The researchers also categorized questions into five types: Attribute (e.g., about size, material), Function (e.g., purpose), Location (e.g., where it’s found), Category (e.g., Is it a type of X?), and Direct guesses (e.g., Is it a table?).
Key Findings and Insights
The study, conducted across 858 games, revealed several important findings:
-
Open-Ended Questions are More Informative: Open-ended questions consistently led to higher success rates (39.4% vs. 32.1% for binary questions with LLaMA-3.3 70B) and higher average information gain. However, LLMs rarely used them by default, suggesting a missed opportunity.
-
Attribute Questions Lead the Way: Attribute-based questions (e.g., about material or shape) were found to be the most informative, achieving the highest average IG and better task performance when used in isolation compared to Function or Location questions.
-
Information Gain Predicts Efficiency: A higher information gain strongly predicted faster convergence to the correct object. A one-standard-deviation increase in Bayesian IG, for instance, correlated with a 43% reduction in expected game length.
-
Prompting Strategies Improve Performance: Simple prompting constraints, such as preventing repeated question types or forcing open-ended questions, dramatically improved LLM performance. LLaMA-3.3 70B’s success rate jumped from 39.4% to 80.0% with a repeat-type constraint and to 97.4% when forced to ask open-ended questions. This highlights that even weaker models can match or outperform stronger ones if guided to ask better questions.
-
Human-Aligned Metric: When applied to human-generated dialogues, the Bayesian IG metric showed strong correlations with game efficiency (Spearman ρ=−0.95 for experts), suggesting it captures a general notion of question informativeness that aligns with human intuition.
Also Read:
- Unlocking Reliable AI Evaluation: How LLMs Can Judge More Effectively by Referencing Themselves
- The Hidden Flaws in AI Evaluation: Why LLM Judge Benchmarks Need a Rethink
Limitations and Future Directions
The paper acknowledges several limitations, including the external nature of the Bayesian belief modeling (not reflecting internal LLM states), dependence on the Interpreter LLM’s accuracy, and the coverage limitations of ConceptNet for the entropy-based metric. Future work could explore aligning external belief models with internal representations, validating alternative interpreters, and expanding the domain scope beyond everyday objects to more abstract or high-stakes scenarios.
In conclusion, GuessingGame provides a robust framework for evaluating and improving LLMs’ strategic question-asking abilities, paving the way for more interactive and intelligent AI systems.


