TLDR: A new benchmark called ORACLE uses “black-box interaction” to evaluate how large language models (LLMs) reason in unknown, interactive environments. LLMs must uncover hidden rules by observing input-output pairs across six task types. While top LLMs perform well on easy tasks, they struggle with complex ones, primarily due to a lack of high-level planning and adaptive exploration strategies, revealing a key limitation in their advanced reasoning capabilities.
The field of artificial intelligence is constantly pushing the boundaries of what large language models (LLMs) can achieve, especially in complex reasoning tasks. However, a significant challenge remains in truly evaluating these models in environments that mimic the real world – interactive and unknown. Current methods often assess different types of reasoning, like deduction or induction, in isolation, missing the integrated reasoning process that is crucial for human discovery.
To address this, a new evaluation approach called “black-box interaction” has been introduced. Imagine a hidden function, a “black-box,” that takes specific inputs and produces outputs. Large language models are then tasked with figuring out the secret rules of this black-box by interacting with it over a series of turns, analyzing the input-output pairs they observe. This novel paradigm aims to test what the researchers call “advanced reasoning” in LLMs.
Leveraging this idea, the ORACLE benchmark was developed. This comprehensive benchmark features 96 unique black-boxes across six distinct task types. These tasks include:
Code Intent Inference (CII)
Understanding a hidden code algorithm.
Circuit Rule Inference (CRI)
Deciphering the logic of an acyclic boolean circuit.
Physics System Inference (PSI)
Inferring the laws governing a classical mechanical system.
Encryption Rule Inference (ERI)
Uncovering the method behind an encryption process.
Interactive Puzzle Inference (IPI)
Solving a puzzle with a hidden answer through interaction.
Also Read:
- Bridging Language and Logic: How AI Models Tackle Complex Optimization Problems
- Benchmarking AI’s Tool-Using Abilities in the Real World
Game Strategy Inference (GSI)
Deducing an opponent’s fixed game strategy to outperform it.
The ORACLE benchmark is designed to be highly adaptable and scalable, thanks to a fully automated agentic framework for black-box construction. This framework uses three LLM-based modules – a Coding LLM, a Test LLM, and a Refinement LLM – to generate diverse black-boxes from natural language descriptions, simulate interactions, and iteratively debug the code. This process significantly reduces the cost and effort involved in creating and expanding such a benchmark.
The researchers evaluated 19 leading large language models, both proprietary and open-weight, on the ORACLE benchmark. While top-performing models like o3, o4-mini, and gemini-2.5-pro showed strong results, especially on easier tasks (achieving over 70% accuracy), they still struggled significantly with harder black-boxes, where their average performance dropped below 40%.
A critical and universal weakness emerged from the analysis: LLMs generally lack the high-level planning capability needed to develop efficient and adaptive exploration strategies. This deficiency hinders their ability to effectively refine hypotheses and understand complex black-box mechanisms, especially under limited interaction turns. Even when given more exploration turns and evaluation attempts, the performance gains were often negligible in tasks like Encryption Rule Inference and Game Strategy Inference, suggesting a fundamental limitation in their adaptive reasoning. The study categorizes exploration strategies into three tiers, noting that most LLMs operate at Tier 1 (random exploration), with the best performing at Tier 2 (efficient but not adaptively optimized), while Tier 3 (adaptive optimization) remains largely a human domain.
This research introduces a valuable new tool for evaluating the integrated, human-like reasoning abilities of large language models. It highlights the current limitations of LLMs in strategic planning and adaptive exploration, pointing towards crucial areas for future development in artificial general intelligence. For more details, you can read the full paper here.


