Benchmarking AI's Problem-Solving: Introducing PUZZLEPLEX for Reasoning and Planning

TLDR: PUZZLEPLEX is a new benchmark with 15 novel puzzles designed to evaluate foundation models’ reasoning and planning. It tests models in instruction-based (natural language) and code-based (generating executable code) settings, across single/two-player, deterministic/stochastic, and text/text-image formats. Findings show reasoning models excel in instruction-based tasks, open-source models are competitive, and code-based tasks are more challenging but efficient. The benchmark highlights current AI limitations in multi-hop reasoning and program synthesis, guiding future AI development.

Foundation models, the powerful AI systems behind many recent breakthroughs, have shown incredible progress in understanding and generating human language. However, a deeper question remains: how well can these models truly reason and plan, especially in complex situations that demand sustained, structured thinking?

To answer this, researchers from New York University, Zhejiang University, Yale University, University at Buffalo, SUNY, and NYU Grossman School of Medicine have introduced a new benchmark called PUZZLEPLEX. This innovative platform is designed to rigorously test the reasoning and planning abilities of these advanced AI models using a diverse collection of puzzles. The full research paper can be found here.

Introducing PUZZLEPLEX

Unlike previous benchmarks that often reuse well-known puzzles, PUZZLEPLEX features 15 entirely new, carefully selected puzzles. This ensures that the models haven’t simply memorized solutions from their training data. The puzzles cover a wide range of scenarios, including:

Single-player and two-player games.
Deterministic environments (where outcomes are predictable) and stochastic environments (where chance plays a role).
Puzzles presented as text-only or a combination of text and images.

The framework is also designed to be adaptable, allowing for the creation of even more challenging puzzle instances as AI models continue to evolve. To provide a baseline for comparison, the researchers also implemented specialized game-playing strategies for each puzzle.

How Models Were Evaluated

The study assessed foundation models under two distinct evaluation protocols:

Instruction-based Evaluation: In this setting, models interact with the puzzles using natural language, much like a human player would. They receive instructions and provide their moves or decisions in text format.

Code-based Evaluation: Here, the models are tasked with generating executable code that can solve the puzzles autonomously. This approach tests not only their reasoning but also their ability to synthesize correct and functional programs.

Key Findings from the Benchmark

The results offer valuable insights into the current capabilities and limitations of foundation models:

Reasoning Models Excel in Instruction-based Settings: Models specifically designed for reasoning, such as DeepSeek-R1, consistently outperformed non-reasoning models when interacting through natural language instructions. This suggests that allowing models to “think” more deeply during the task (known as test-time scaling) significantly improves their performance.

Open-Source Models are Catching Up: A notable finding was the strong performance of open-source models. DeepSeek-R1, for example, achieved the highest normalized score in the instruction-based setting, even surpassing some proprietary models like Gemini-2.5-pro. This indicates rapid progress in the open-source AI community.

Code-based Evaluation Poses Greater Challenges: While promising for efficiency, the code-based setting proved more difficult for models. Generating accurate and executable code requires a different set of skills, leading to a noticeable drop in performance compared to instruction-based interactions. However, the study also found that by generating multiple code samples and picking the best one, performance could significantly improve.

Prompting Strategies Have Mixed Results: The effectiveness of advanced prompting techniques, like Chain-of-Thought (CoT) or Tree-of-Thought (ToT), was mixed. Interestingly, for some puzzles, removing the model’s past reasoning history actually led to better results, suggesting that current models can sometimes be misled by their own previous “thoughts” in multi-step reasoning tasks. However, providing models with a list of legal moves consistently boosted performance, as it helped them avoid making invalid actions.

Multimodal Inputs Offer Benefits: For puzzles that included visual information (text-image format), most models showed improved performance when incorporating these visual inputs. This highlights the value of image-based representations in strategic puzzle-solving, though weaker models sometimes struggled to effectively utilize this information.

Scaling and Error Analysis: Reasoning models demonstrated a better correlation between the amount of “thinking” (measured by generated tokens) and improved performance. In instruction-based settings, they also made fewer errors. In code-based settings, however, challenges like syntax errors and runtime errors became more prevalent, even for reasoning models.

Also Read:

Looking Ahead

PUZZLEPLEX provides a robust new tool for evaluating and guiding the development of foundation models. By exposing their strengths and weaknesses in reasoning, planning, and generalization across diverse and novel puzzle types, this benchmark will help researchers push the boundaries of AI capabilities, especially in areas requiring complex, multi-step problem-solving.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Benchmarking AI’s Problem-Solving: Introducing PUZZLEPLEX for Reasoning and Planning

Introducing PUZZLEPLEX

How Models Were Evaluated

Key Findings from the Benchmark

Looking Ahead

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates