TLDR: RAMP is a novel, lightweight, multi-agent framework for Automated Program Repair (APR) in Ruby. It uses collaborative agents for feedback-driven, iterative bug fixing, generating tests, reflecting on errors, and refining solutions without relying on large multilingual databases or costly fine-tuning. RAMP achieves a 67% pass@1 on the XCodeEval benchmark for Ruby, outperforming existing methods, converging quickly, and proving effective for ‘wrong answer’, ‘compilation’, and ‘runtime’ errors, establishing a new foundation for LLM-based debugging in under-studied languages.
Software development often involves the time-consuming and error-prone task of debugging and fixing bugs. While traditional Automated Program Repair (APR) methods exist, the rise of Large Language Models (LLMs) has opened new avenues for more flexible and context-aware solutions. However, many LLM-based APR approaches are computationally expensive, require extensive fine-tuning, or focus on a limited set of programming languages, often overlooking languages like Ruby.
Ruby, despite its widespread use in web development and the persistent debugging challenges faced by its developers, has received little attention in APR research. Addressing this gap, a new framework called RAMP (Ruby Automated Multi-agent Program repair) has been introduced. RAMP is a lightweight, feedback-driven system that approaches program repair as an iterative process specifically for Ruby.
What is RAMP and How Does it Work?
RAMP distinguishes itself by avoiding reliance on large multilingual repair databases or costly fine-tuning. Instead, it operates directly on Ruby code using lightweight prompting and test-driven feedback. The framework employs a team of collaborative agents, each with a specialized role, to generate targeted tests, reflect on errors, and refine candidate fixes until a correct solution is found. This multi-agent workflow allows for deeper semantic reasoning while remaining cost-efficient.
The core of RAMP’s methodology involves an iterative loop coordinated by four specialized agents:
- Feedback Integrator Agent: This agent initiates the process by hypothesizing the potential cause of a bug in natural language. It also updates this reflection based on execution traces and error logs during subsequent iterations, guiding the repair process.
- Test Designer Agent: Responsible for generating a compact yet diverse set of guiding test cases (basic, edge, and large-scale inputs). These tests are crucial for evaluating candidate repairs and providing feedback without the computational expense of running a large benchmark suite.
- Programmer Agent: This agent generates candidate repair programs. It receives the problem context, buggy code, and prior reflections, and is prompted to reason about the bug before proposing a fix. It iteratively refines solutions based on feedback.
- Test Executor Agent: A non-LLM component, this Python script executes the candidate Ruby code against the generated test cases. It captures outputs, exceptions, and exit statuses, providing verdicts and traces that inform the Feedback Integrator Agent.
The process continues until a candidate repair passes all generated tests or an iteration budget is exhausted. Only then is the solution validated against hidden benchmark tests.
Performance and Key Insights
Evaluated on the XCodeEval benchmark, RAMP achieved a pass@1 score of 67% on Ruby, significantly outperforming prior approaches like LANTERN (61.7%) and other prompting baselines. A notable aspect of RAMP is its rapid convergence, often finding solutions within five iterations. Ablation studies confirmed that both test generation and self-reflection are critical drivers of its performance, especially for models like DeepSeekCoder.
RAMP proved particularly effective at repairing programs that initially produced ‘WRONG_ANSWER’ (68.5% repaired), ‘COMPILATION_ERROR’ (66.7% repaired), and ‘RUNTIME_ERROR’ (60.4% repaired). However, it struggled more with resource-related failures like ‘TIME_LIMIT_EXCEEDED’.
When analyzing performance across different problem categories, RAMP achieved perfect success on problems tagged with ‘geometry’ and ‘strings’. It also showed strong performance on ‘brute force’, ‘dynamic programming (dp)’, ‘math’, ‘games’, and ‘graphs’. Conversely, it faced challenges with advanced or niche categories such as ‘binary search’, ‘bitmasks’, ‘matrices’, and ‘graph matchings’, which often require highly precise reasoning and domain-specific knowledge.
Also Read:
- RefAgent: A Multi-Agent AI Framework for Smarter Software Refactoring
- EvoDev: An Iterative Framework for AI-Powered Software Development
Practicality and Future Directions
The framework demonstrates a strong balance between accuracy and computational efficiency, offering a practical solution for Ruby APR. Its design also allows for relatively easy adaptation to other programming languages by simply swapping the executor and updating few-shot examples. For instance, RAMP has shown promising results on C++ as well.
The introduction of RAMP provides new insights into multi-agent repair strategies and lays a foundation for extending LLM-based debugging tools to under-studied languages. Future research aims to further enhance domain-specific reasoning and improve the reliability of the generated tests to strengthen RAMP’s iterative repair loop. For more details, you can read the full research paper here.


