GuessingGame: A New Protocol for Evaluating How LLMs Ask Questions

TLDR: The GuessingGame research introduces a novel protocol to evaluate Large Language Models (LLMs) as strategic question-askers in open-ended, open-domain settings. It uses a Guesser LLM to identify a hidden object by asking free-form questions to an Oracle LLM. The paper proposes two information gain (IG) metrics—a Bayesian belief-tracking method and an entropy-based method using ConceptNet—to measure question quality. Findings show that open-ended questions and specific question types (like ‘Attribute’) are more informative, and higher IG strongly predicts game efficiency. Crucially, simple prompting strategies can significantly improve LLM question-asking performance, demonstrating that this capability is both measurable and improvable. The Bayesian IG metric also correlates strongly with human performance, suggesting its general utility in assessing question informativeness.

Large Language Models (LLMs) have shown impressive capabilities in understanding and generating human-like text, excelling in tasks like factual recall and multi-turn dialogue. However, their ability to strategically ask questions, especially in open-ended and open-domain scenarios, has been less explored. This is a crucial area for interactive applications such as education, medical diagnosis, and autonomous decision-making, where knowing what to ask can be as important as knowing how to answer.

A new research paper introduces a novel evaluation protocol called GuessingGame to address this gap. The protocol aims to measure how effectively LLMs can act as strategic question-askers in identifying a hidden object without predefined choices or candidate lists. The full paper can be accessed here: GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models.

How GuessingGame Works

The GuessingGame involves three distinct LLM agents: an Oracle, a Guesser, and a Checker. The Oracle holds a secret physical object. The Guesser’s task is to identify this hidden object by asking free-form questions to the Oracle. The Checker ensures that the Guesser’s questions adhere to any experimental constraints and classifies the type of question being asked. A game proceeds in alternating turns, ending when the Guesser correctly identifies the object or after a maximum of 50 turns.

To evaluate performance, the researchers use two primary metrics: Success Rate (SR), which measures the proportion of games where the Guesser correctly identifies the object, and Average Number of Questions (ANQ), indicating the efficiency of successful games.

Measuring Question Quality with Information Gain

Beyond just success and efficiency, the paper introduces two innovative Information Gain (IG) metrics to quantify how much each question-answer pair reduces uncertainty:

Bayesian Belief Update: This method tracks belief updates over semantic concepts. An Interpreter LLM scores the relevance of concepts implied by the Oracle’s answer, and these scores are used to update a probability distribution over potential concepts. This approach is flexible and doesn’t require a fixed knowledge base.
Entropy-Based Information Gain: This method uses ConceptNet, a large commonsense knowledge graph, to filter candidate objects based on assertions implied by the Oracle’s answer. Information gain is then measured as the reduction in entropy (uncertainty) over the set of possible objects.

The researchers also categorized questions into five types: Attribute (e.g., about size, material), Function (e.g., purpose), Location (e.g., where it’s found), Category (e.g., Is it a type of X?), and Direct guesses (e.g., Is it a table?).

Key Findings and Insights

The study, conducted across 858 games, revealed several important findings:

Open-Ended Questions are More Informative: Open-ended questions consistently led to higher success rates (39.4% vs. 32.1% for binary questions with LLaMA-3.3 70B) and higher average information gain. However, LLMs rarely used them by default, suggesting a missed opportunity.
Attribute Questions Lead the Way: Attribute-based questions (e.g., about material or shape) were found to be the most informative, achieving the highest average IG and better task performance when used in isolation compared to Function or Location questions.
Information Gain Predicts Efficiency: A higher information gain strongly predicted faster convergence to the correct object. A one-standard-deviation increase in Bayesian IG, for instance, correlated with a 43% reduction in expected game length.
Prompting Strategies Improve Performance: Simple prompting constraints, such as preventing repeated question types or forcing open-ended questions, dramatically improved LLM performance. LLaMA-3.3 70B’s success rate jumped from 39.4% to 80.0% with a repeat-type constraint and to 97.4% when forced to ask open-ended questions. This highlights that even weaker models can match or outperform stronger ones if guided to ask better questions.
Human-Aligned Metric: When applied to human-generated dialogues, the Bayesian IG metric showed strong correlations with game efficiency (Spearman ρ=−0.95 for experts), suggesting it captures a general notion of question informativeness that aligns with human intuition.

Also Read:

Limitations and Future Directions

The paper acknowledges several limitations, including the external nature of the Bayesian belief modeling (not reflecting internal LLM states), dependence on the Interpreter LLM’s accuracy, and the coverage limitations of ConceptNet for the entropy-based metric. Future work could explore aligning external belief models with internal representations, validating alternative interpreters, and expanding the domain scope beyond everyday objects to more abstract or high-stakes scenarios.

In conclusion, GuessingGame provides a robust framework for evaluating and improving LLMs’ strategic question-asking abilities, paving the way for more interactive and intelligent AI systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GuessingGame: A New Protocol for Evaluating How LLMs Ask Questions

How GuessingGame Works

Measuring Question Quality with Information Gain

Key Findings and Insights

Limitations and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates