spot_img
HomeResearch & DevelopmentEnhancing Language Models' Reasoning Through Structured Planning: The CRISP...

Enhancing Language Models’ Reasoning Through Structured Planning: The CRISP Dataset

TLDR: CRISP is a new dataset designed to train large language models (LLMs) to generate high-quality, step-based plans for complex problems in mathematics and code generation. The research shows that fine-tuning even small LLMs on CRISP significantly improves their planning abilities, leading to better performance on reasoning tasks compared to larger models using traditional methods like Chain-of-Thought, and demonstrates strong transferability of planning skills across different domains.

Large language models (LLMs) have made significant strides in areas like logical reasoning, code generation, and mathematical problem-solving. A key method behind these advancements is Chain-of-Thought (CoT) prompting, which helps LLMs break down complex tasks into manageable steps. However, CoT still has limitations, often leading to errors like missing intermediate steps or semantic misunderstandings.

A promising alternative involves explicit high-level plan generation, where an LLM first creates a structured plan before attempting to solve a problem. While this “plan-and-solve” approach has shown improvements, existing methods often assume that LLMs can generate effective plans through simple prompting without additional training. Researchers at IBM Research, including Matan Vetzler, Koren Lazar, Guy Uziel, Eran Hirsch, Ateret Anaby-Tavor, and Leshem Choshen, challenged this assumption with their work on CRISP.

Introducing CRISP: A Dataset for Better Planning

The new research introduces CRISP (Complex Reasoning with Interpretable Step-based Plans), a novel multi-domain dataset designed to enhance the high-level planning capabilities of LLMs. CRISP focuses on two key domains: mathematical reasoning and code generation, where solutions naturally break down into structured, high-level steps. The dataset was built using annotated detailed solutions from Magpie-Reasoning-V1-150K, a large dataset of reasoning examples.

The plans within CRISP are not just generated; they undergo a rigorous two-step validation process. First, an LLM acts as a judge to intrinsically validate the plans for clarity, conciseness, coherence, and completeness. Plans that fail any of these criteria are discarded. Second, an extrinsic validation step assesses the plan’s actual impact on downstream task performance. Plans are only retained if they lead to more correct answers when used by an LLM to solve the original problem, compared to solving it without a high-level plan.

Also Read:

Key Findings and Impact

The experiments conducted with CRISP yielded several significant findings:

  • Superior Plan Generation: The research demonstrates that fine-tuning a relatively small model on CRISP enables it to generate higher-quality plans than much larger, off-the-shelf models using only few-shot prompting. This highlights that high-level plan generation is a learned capability that can be significantly improved through targeted training.

  • Enhanced Performance: When these high-quality plans are used, LLMs significantly outperform traditional Chain-of-Thought reasoning across various benchmarks, including MBPP and HumanEval for code generation, and GSM8K and MATH for mathematical problem-solving. Improvements in error reduction reached up to 28% in some cases.

  • Quality Over Quantity: Intriguingly, the fine-tuned models generated plans that were often shorter, yet more concise, coherent, and complete, suggesting that fewer, well-structured steps are more impactful than a greater number of less refined steps.

  • Domain Generalizability: One of the most compelling findings is the strong transferability of planning capabilities across domains. A model fine-tuned on the Math domain, for instance, showed impressive performance on code generation tasks, nearly matching models specifically trained on coding. This suggests that the abstract reasoning and general problem-solving strategies learned from one domain can effectively transfer to others, enhancing versatility.

The study concludes that explicit fine-tuning on high-level planning, as facilitated by the CRISP dataset, significantly enhances an LLM’s ability to decompose tasks. This improvement makes LLMs more robust and applicable to real-world scenarios requiring complex, domain-agnostic reasoning. The CRISP dataset is publicly available, encouraging further research into explicit planning mechanisms and structured reasoning in natural language processing. You can read the full research paper here: CRISP: Complex Reasoning with Interpretable Step-based Plans.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -