spot_img
HomeResearch & DevelopmentBenchmarking AI's Problem-Solving: Introducing PUZZLEPLEX for Reasoning and Planning

Benchmarking AI’s Problem-Solving: Introducing PUZZLEPLEX for Reasoning and Planning

TLDR: PUZZLEPLEX is a new benchmark with 15 novel puzzles designed to evaluate foundation models’ reasoning and planning. It tests models in instruction-based (natural language) and code-based (generating executable code) settings, across single/two-player, deterministic/stochastic, and text/text-image formats. Findings show reasoning models excel in instruction-based tasks, open-source models are competitive, and code-based tasks are more challenging but efficient. The benchmark highlights current AI limitations in multi-hop reasoning and program synthesis, guiding future AI development.

Foundation models, the powerful AI systems behind many recent breakthroughs, have shown incredible progress in understanding and generating human language. However, a deeper question remains: how well can these models truly reason and plan, especially in complex situations that demand sustained, structured thinking?

To answer this, researchers from New York University, Zhejiang University, Yale University, University at Buffalo, SUNY, and NYU Grossman School of Medicine have introduced a new benchmark called PUZZLEPLEX. This innovative platform is designed to rigorously test the reasoning and planning abilities of these advanced AI models using a diverse collection of puzzles. The full research paper can be found here.

Introducing PUZZLEPLEX

Unlike previous benchmarks that often reuse well-known puzzles, PUZZLEPLEX features 15 entirely new, carefully selected puzzles. This ensures that the models haven’t simply memorized solutions from their training data. The puzzles cover a wide range of scenarios, including:

  • Single-player and two-player games.
  • Deterministic environments (where outcomes are predictable) and stochastic environments (where chance plays a role).
  • Puzzles presented as text-only or a combination of text and images.

The framework is also designed to be adaptable, allowing for the creation of even more challenging puzzle instances as AI models continue to evolve. To provide a baseline for comparison, the researchers also implemented specialized game-playing strategies for each puzzle.

How Models Were Evaluated

The study assessed foundation models under two distinct evaluation protocols:

Instruction-based Evaluation: In this setting, models interact with the puzzles using natural language, much like a human player would. They receive instructions and provide their moves or decisions in text format.

Code-based Evaluation: Here, the models are tasked with generating executable code that can solve the puzzles autonomously. This approach tests not only their reasoning but also their ability to synthesize correct and functional programs.

Key Findings from the Benchmark

The results offer valuable insights into the current capabilities and limitations of foundation models:

Reasoning Models Excel in Instruction-based Settings: Models specifically designed for reasoning, such as DeepSeek-R1, consistently outperformed non-reasoning models when interacting through natural language instructions. This suggests that allowing models to “think” more deeply during the task (known as test-time scaling) significantly improves their performance.

Open-Source Models are Catching Up: A notable finding was the strong performance of open-source models. DeepSeek-R1, for example, achieved the highest normalized score in the instruction-based setting, even surpassing some proprietary models like Gemini-2.5-pro. This indicates rapid progress in the open-source AI community.

Code-based Evaluation Poses Greater Challenges: While promising for efficiency, the code-based setting proved more difficult for models. Generating accurate and executable code requires a different set of skills, leading to a noticeable drop in performance compared to instruction-based interactions. However, the study also found that by generating multiple code samples and picking the best one, performance could significantly improve.

Prompting Strategies Have Mixed Results: The effectiveness of advanced prompting techniques, like Chain-of-Thought (CoT) or Tree-of-Thought (ToT), was mixed. Interestingly, for some puzzles, removing the model’s past reasoning history actually led to better results, suggesting that current models can sometimes be misled by their own previous “thoughts” in multi-step reasoning tasks. However, providing models with a list of legal moves consistently boosted performance, as it helped them avoid making invalid actions.

Multimodal Inputs Offer Benefits: For puzzles that included visual information (text-image format), most models showed improved performance when incorporating these visual inputs. This highlights the value of image-based representations in strategic puzzle-solving, though weaker models sometimes struggled to effectively utilize this information.

Scaling and Error Analysis: Reasoning models demonstrated a better correlation between the amount of “thinking” (measured by generated tokens) and improved performance. In instruction-based settings, they also made fewer errors. In code-based settings, however, challenges like syntax errors and runtime errors became more prevalent, even for reasoning models.

Also Read:

Looking Ahead

PUZZLEPLEX provides a robust new tool for evaluating and guiding the development of foundation models. By exposing their strengths and weaknesses in reasoning, planning, and generalization across diverse and novel puzzle types, this benchmark will help researchers push the boundaries of AI capabilities, especially in areas requiring complex, multi-step problem-solving.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -