CodeAgents: Boosting LLM Agent Performance and Efficiency with Codified Reasoning

TLDR: CodeAgents is a new framework that improves Large Language Model (LLM) agents by codifying their interactions and reasoning into structured pseudocode. This approach significantly enhances planning capabilities, reduces token usage by 55-87%, and improves task accuracy by 3-36 percentage points across various benchmarks like GAIA, HotpotQA, and VirtualHome, making LLM-driven multi-agent systems more scalable and interpretable.

Large Language Models (LLMs) are becoming increasingly powerful in driving AI agents, helping them plan and execute complex tasks. However, current methods often face challenges like excessive verbosity, high token usage, and limitations in multi-agent scenarios. These issues can make LLM-driven agents less efficient and harder to manage.

To address these limitations, researchers have introduced CodeAgents, a novel framework designed to make multi-agent reasoning more structured and token-efficient. CodeAgents transforms the way LLM agents interact by codifying all aspects of their communication and planning into modular pseudocode. This includes tasks, plans, feedback, system roles, and even external tool invocations. By using pseudocode, which incorporates control structures like loops and conditionals, boolean logic, and typed variables, CodeAgents turns loosely connected agent plans into cohesive, interpretable, and verifiable reasoning programs.

How CodeAgents Works

The core idea behind CodeAgents is to treat a complex reasoning task like a program. Instead of relying on verbose natural language dialogues, the framework provides a pseudocode template that the LLM fills in and follows. This approach explicitly defines interactions between different agents, such as a Planner that outlines high-level plans, a Solver that executes detailed reasoning, and a Reviewer that provides feedback. Agents communicate clearly through well-defined variables, iterating as needed within a coherent prompt.

CodeAgents introduces several key innovations to enhance expressivity and efficiency. It uses typed variables for clear data distinctions, control flow structures for dynamic reasoning, and reusable subroutines for modularity. Crucially, the entire prompting approach is optimized for token-cost awareness, ensuring efficient use of LLM resources without sacrificing reasoning quality.

Single-Agent and Multi-Agent Architectures

The framework supports both single-agent and multi-agent configurations. In the single-agent setup, planning, execution, and feedback are integrated into one loop. For example, in a simulated environment like VirtualHome, an agent generates pseudocode plans, executes them step-by-step, and uses runtime feedback for iterative replanning. Assertion checks are embedded to catch and recover from local errors, escalating to global plan revisions when necessary.

The multi-agent framework, on the other hand, distributes these roles across specialized agents like Planner, ToolCaller, and Replanner. These agents collaborate through structured code-based exchanges. Each agent is initialized with a codified system prompt in YAML format, specifying its role and available tools. The Planner generates high-level plans as Python-style pseudocode, which can then be transformed into executable code for tool invocations or direct execution. If a tool execution fails, the Replanner agent is activated, consuming structured error traces to synthesize a revised sub-plan, enhancing system robustness.

Empirical Performance and Efficiency

CodeAgents was rigorously evaluated across three diverse benchmarks: GAIA, HotpotQA, and VirtualHome. The results consistently showed significant improvements in planning performance compared to natural language prompting baselines. For instance, on VirtualHome, CodeAgents achieved a new state-of-the-art success rate of 56%. In addition to accuracy gains, the approach drastically reduced input and output token usage, by 55–87% and 41–70% respectively, highlighting its superior token efficiency.

On multi-agent benchmarks like GAIA and HotpotQA, CodeAgents consistently matched or outperformed natural language methods in accuracy and F1 scores, while substantially cutting down token usage and cost. For example, on GAIA, the codified approach improved accuracy by 10.7% for Gemini-2.5-Flash, reducing input tokens by 67.8% and cost by 67.4%. These improvements are attributed to the high semantic density and reduced ambiguity of the codified format, requiring fewer tokens for reasoning cycles.

Also Read:

The Future of LLM Agents

The research paper concludes that this codified prompting framework significantly enhances LLM reasoning by representing agent interactions as typed pseudocode with modular control flows. This structure not only improves transparency and execution reliability but also boosts token efficiency. The findings suggest a promising path towards more interpretable and verifiable AI systems. For more detailed information, you can refer to the full research paper available at arXiv:2507.03254.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CodeAgents: Boosting LLM Agent Performance and Efficiency with Codified Reasoning

How CodeAgents Works

Single-Agent and Multi-Agent Architectures

Empirical Performance and Efficiency

The Future of LLM Agents

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates