TLDR: PUZZLEPLEX is a new benchmark with 15 novel puzzles designed to evaluate foundation models’ reasoning and planning. It tests models in instruction-based (natural language) and code-based (generating executable code) settings, across single/two-player, deterministic/stochastic, and text/text-image formats. Findings show reasoning models excel in instruction-based tasks, open-source models are competitive, and code-based tasks are more challenging but efficient. The benchmark highlights current AI limitations in multi-hop reasoning and program synthesis, guiding future AI development.
Foundation models, the powerful AI systems behind many recent breakthroughs, have shown incredible progress in understanding and generating human language. However, a deeper question remains: how well can these models truly reason and plan, especially in complex situations that demand sustained, structured thinking?
To answer this, researchers from New York University, Zhejiang University, Yale University, University at Buffalo, SUNY, and NYU Grossman School of Medicine have introduced a new benchmark called PUZZLEPLEX. This innovative platform is designed to rigorously test the reasoning and planning abilities of these advanced AI models using a diverse collection of puzzles. The full research paper can be found here.
Introducing PUZZLEPLEX
Unlike previous benchmarks that often reuse well-known puzzles, PUZZLEPLEX features 15 entirely new, carefully selected puzzles. This ensures that the models haven’t simply memorized solutions from their training data. The puzzles cover a wide range of scenarios, including:
- Single-player and two-player games.
- Deterministic environments (where outcomes are predictable) and stochastic environments (where chance plays a role).
- Puzzles presented as text-only or a combination of text and images.
The framework is also designed to be adaptable, allowing for the creation of even more challenging puzzle instances as AI models continue to evolve. To provide a baseline for comparison, the researchers also implemented specialized game-playing strategies for each puzzle.
How Models Were Evaluated
The study assessed foundation models under two distinct evaluation protocols:
Instruction-based Evaluation: In this setting, models interact with the puzzles using natural language, much like a human player would. They receive instructions and provide their moves or decisions in text format.
Code-based Evaluation: Here, the models are tasked with generating executable code that can solve the puzzles autonomously. This approach tests not only their reasoning but also their ability to synthesize correct and functional programs.
Key Findings from the Benchmark
The results offer valuable insights into the current capabilities and limitations of foundation models:
Reasoning Models Excel in Instruction-based Settings: Models specifically designed for reasoning, such as DeepSeek-R1, consistently outperformed non-reasoning models when interacting through natural language instructions. This suggests that allowing models to “think” more deeply during the task (known as test-time scaling) significantly improves their performance.
Open-Source Models are Catching Up: A notable finding was the strong performance of open-source models. DeepSeek-R1, for example, achieved the highest normalized score in the instruction-based setting, even surpassing some proprietary models like Gemini-2.5-pro. This indicates rapid progress in the open-source AI community.
Code-based Evaluation Poses Greater Challenges: While promising for efficiency, the code-based setting proved more difficult for models. Generating accurate and executable code requires a different set of skills, leading to a noticeable drop in performance compared to instruction-based interactions. However, the study also found that by generating multiple code samples and picking the best one, performance could significantly improve.
Prompting Strategies Have Mixed Results: The effectiveness of advanced prompting techniques, like Chain-of-Thought (CoT) or Tree-of-Thought (ToT), was mixed. Interestingly, for some puzzles, removing the model’s past reasoning history actually led to better results, suggesting that current models can sometimes be misled by their own previous “thoughts” in multi-step reasoning tasks. However, providing models with a list of legal moves consistently boosted performance, as it helped them avoid making invalid actions.
Multimodal Inputs Offer Benefits: For puzzles that included visual information (text-image format), most models showed improved performance when incorporating these visual inputs. This highlights the value of image-based representations in strategic puzzle-solving, though weaker models sometimes struggled to effectively utilize this information.
Scaling and Error Analysis: Reasoning models demonstrated a better correlation between the amount of “thinking” (measured by generated tokens) and improved performance. In instruction-based settings, they also made fewer errors. In code-based settings, however, challenges like syntax errors and runtime errors became more prevalent, even for reasoning models.
Also Read:
- Unveiling AI’s Scientific Discovery Prowess: A New Benchmark for Language Models
- Measuring the Creative Mind of AI: Introducing C2-Eval
Looking Ahead
PUZZLEPLEX provides a robust new tool for evaluating and guiding the development of foundation models. By exposing their strengths and weaknesses in reasoning, planning, and generalization across diverse and novel puzzle types, this benchmark will help researchers push the boundaries of AI capabilities, especially in areas requiring complex, multi-step problem-solving.


