Enhancing Language Models' Reasoning Through Structured Planning: The CRISP Dataset

TLDR: CRISP is a new dataset designed to train large language models (LLMs) to generate high-quality, step-based plans for complex problems in mathematics and code generation. The research shows that fine-tuning even small LLMs on CRISP significantly improves their planning abilities, leading to better performance on reasoning tasks compared to larger models using traditional methods like Chain-of-Thought, and demonstrates strong transferability of planning skills across different domains.

Large language models (LLMs) have made significant strides in areas like logical reasoning, code generation, and mathematical problem-solving. A key method behind these advancements is Chain-of-Thought (CoT) prompting, which helps LLMs break down complex tasks into manageable steps. However, CoT still has limitations, often leading to errors like missing intermediate steps or semantic misunderstandings.

A promising alternative involves explicit high-level plan generation, where an LLM first creates a structured plan before attempting to solve a problem. While this “plan-and-solve” approach has shown improvements, existing methods often assume that LLMs can generate effective plans through simple prompting without additional training. Researchers at IBM Research, including Matan Vetzler, Koren Lazar, Guy Uziel, Eran Hirsch, Ateret Anaby-Tavor, and Leshem Choshen, challenged this assumption with their work on CRISP.

Introducing CRISP: A Dataset for Better Planning

The new research introduces CRISP (Complex Reasoning with Interpretable Step-based Plans), a novel multi-domain dataset designed to enhance the high-level planning capabilities of LLMs. CRISP focuses on two key domains: mathematical reasoning and code generation, where solutions naturally break down into structured, high-level steps. The dataset was built using annotated detailed solutions from Magpie-Reasoning-V1-150K, a large dataset of reasoning examples.

The plans within CRISP are not just generated; they undergo a rigorous two-step validation process. First, an LLM acts as a judge to intrinsically validate the plans for clarity, conciseness, coherence, and completeness. Plans that fail any of these criteria are discarded. Second, an extrinsic validation step assesses the plan’s actual impact on downstream task performance. Plans are only retained if they lead to more correct answers when used by an LLM to solve the original problem, compared to solving it without a high-level plan.

Also Read:

Key Findings and Impact

The experiments conducted with CRISP yielded several significant findings:

Superior Plan Generation: The research demonstrates that fine-tuning a relatively small model on CRISP enables it to generate higher-quality plans than much larger, off-the-shelf models using only few-shot prompting. This highlights that high-level plan generation is a learned capability that can be significantly improved through targeted training.
Enhanced Performance: When these high-quality plans are used, LLMs significantly outperform traditional Chain-of-Thought reasoning across various benchmarks, including MBPP and HumanEval for code generation, and GSM8K and MATH for mathematical problem-solving. Improvements in error reduction reached up to 28% in some cases.
Quality Over Quantity: Intriguingly, the fine-tuned models generated plans that were often shorter, yet more concise, coherent, and complete, suggesting that fewer, well-structured steps are more impactful than a greater number of less refined steps.
Domain Generalizability: One of the most compelling findings is the strong transferability of planning capabilities across domains. A model fine-tuned on the Math domain, for instance, showed impressive performance on code generation tasks, nearly matching models specifically trained on coding. This suggests that the abstract reasoning and general problem-solving strategies learned from one domain can effectively transfer to others, enhancing versatility.

The study concludes that explicit fine-tuning on high-level planning, as facilitated by the CRISP dataset, significantly enhances an LLM’s ability to decompose tasks. This improvement makes LLMs more robust and applicable to real-world scenarios requiring complex, domain-agnostic reasoning. The CRISP dataset is publicly available, encouraging further research into explicit planning mechanisms and structured reasoning in natural language processing. You can read the full research paper here: CRISP: Complex Reasoning with Interpretable Step-based Plans.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Language Models’ Reasoning Through Structured Planning: The CRISP Dataset

Introducing CRISP: A Dataset for Better Planning

Key Findings and Impact

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

Runloop.ai Launches Enterprise AI Infrastructure with Google Wallet Co-Founder Rob von Behren Joining Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates