TLDR: A new method called Per-Instance Program Synthesis (PIPS) improves Large Language Models’ (LLMs) ability to perform complex, multi-step reasoning. PIPS dynamically chooses between direct inference and program synthesis, iteratively refines generated programs using structural feedback, and extracts symbolic input from unstructured data. This approach significantly boosts accuracy and reduces undesirable program outputs across 30 diverse benchmarks, making LLM reasoning more reliable and efficient.
Large Language Models (LLMs) have shown remarkable capabilities in understanding and generating human-like text, excelling in many zero-shot inference tasks. However, when faced with complex problems requiring multiple steps of reasoning, especially in algorithmic domains, these models often hit a wall. Traditional methods like Chain of Thought (CoT) and Program of Thought (PoT) have attempted to guide LLMs through intermediate reasoning steps, but they frequently produce less-than-ideal solutions, including programs that are trivial, contain errors, or hardcode answers.
A new research paper, “Once Upon an Input: Reasoning via Per-Instance Program Synthesis”, introduces an innovative approach called Per-Instance Program Synthesis (PIPS). Developed by Adam Stein, Neelay Velingker, Mayur Naik, and Eric Wong from the University of Pennsylvania, PIPS aims to tackle these challenges by generating and refining programs at the individual instance level, using structural feedback without needing explicit task-specific guidance or test cases.
Understanding the Core Challenges
The researchers identified three main hurdles in instance-level program synthesis for LLMs:
1. Open Domain Nature: It’s often unclear whether a problem instance is best solved by generating a program or by direct, natural language reasoning (like CoT). Applying program synthesis to non-algorithmic problems can lead to inefficient or trivial code.
2. Lack of Task Specifications: Unlike traditional programming, LLMs generating code for reasoning problems often lack clear specifications for what a ‘correct’ program should look like. This can result in programs that are syntactically correct but functionally trivial or incorrect.
3. Unstructured Input: Programs typically operate on structured data, but many real-world reasoning problems involve unstructured inputs like natural language text or images. LLMs often struggle to bridge this gap, sometimes attempting to process raw data within the generated code itself, leading to errors.
How PIPS Addresses These Issues
PIPS introduces several key mechanisms to overcome these challenges:
Selective Program Synthesis: PIPS incorporates a confidence metric that dynamically decides, for each individual problem instance, whether direct inference (CoT) or program synthesis is the more suitable approach. This prevents the LLM from unnecessarily generating programs for problems that are better handled by direct reasoning, optimizing efficiency and performance.
Iterative Program Refinement with Structural Feedback: To address the lack of explicit task specifications, PIPS employs an iterative process. It generates a program, evaluates it based on structural checks (e.g., for syntax errors, type errors, or trivial solutions), and then uses this feedback to refine the program. This continuous loop ensures that the generated code is well-formed and non-trivial, without requiring human-provided test cases.
Instance-Specific Symbolic Extraction: For unstructured inputs, PIPS explicitly performs a symbolic extraction step. An LLM processes the raw input (e.g., an image or text) to identify relevant entities, attributes, and relationships, converting them into a structured, symbolic format (like JSON) before program synthesis begins. This decouples the perceptual understanding from the algorithmic reasoning, making the program generation more robust.
Also Read:
- Unlocking Advanced Reasoning in Language Models with Code Execution
- Unlocking Advanced Coding in LLMs with Structured Synthetic Data
Impressive Results Across Diverse Tasks
The experiments conducted across three frontier LLMs (Gemini-2.0-Flash, GPT-4.1-mini, and o4-mini) and 30 benchmarks, including tasks from Big Bench Extra Hard (BBEH), visual question answering, relational reasoning, and mathematical reasoning, demonstrate the significant impact of PIPS.
- PIPS improved the absolute harmonic mean accuracy by up to 8.6% compared to PoT and 9.4% compared to CoT.
- It dramatically reduced undesirable program generations by 65.1% on algorithmic tasks when compared to PoT with Gemini-2.0-Flash.
- The confidence metric successfully selected the correct reasoning method (CoT or program synthesis) in 65% of cases, leading to improved performance on both algorithmic and non-algorithmic tasks.
The research highlights that PIPS not only enhances the accuracy of LLMs in complex reasoning tasks but also significantly improves the quality and utility of the generated code. By focusing on instance-level synthesis and leveraging structural feedback, PIPS offers a promising path towards more reliable and interpretable AI reasoning systems.


