ParaCook: A New Benchmark for Time-Efficient Multi-Agent Planning with LLMs

TLDR: ParaCook is a new benchmark for evaluating how well Large Language Models (LLMs) can plan tasks for multiple agents to complete efficiently, focusing on minimizing time rather than just correctness. Inspired by the ‘Overcooked’ game, it uses cooking scenarios to test LLMs’ ability to handle parallel and asynchronous operations. Experiments show that while top LLMs like GPT-5 perform best, they still lag significantly behind human performance in success rate, time efficiency, and coordination, especially in complex tasks. The research also reveals that LLMs have strong high-level planning potential in abstract settings, but struggle with translating this into efficient real-world actions, highlighting a need for better structured planning approaches.

In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) have demonstrated impressive capabilities in planning complex, real-world tasks. However, a significant challenge remains: ensuring these plans are not just correct, but also time-efficient, especially in scenarios involving multiple agents working in parallel. This is the core problem that a new research paper, ParaCook: On Time-Efficient Planning for Multi-Agent Systems, aims to address.

Authored by a collaborative team including Shiqi Zhang, Xinbei Ma, Yunqing Xu, Zouying Cao, Pengrui Lu, and Zhuosheng Zhang from Shanghai Jiao Tong University, and Haobo Yuan, Tiancheng Shen, and Ming-Hsuan Yang from the University of California, Merced, the paper introduces ParaCook, a novel benchmark designed to evaluate time-efficient collaborative planning in multi-agent systems.

The Need for Time-Efficient Planning

Current benchmarks for LLM agents primarily focus on task completion and adherence to rules. While important, this overlooks a crucial aspect of real-world applicability: time efficiency. Imagine a team of robots in a factory or a group of virtual assistants; their success isn’t just about finishing tasks, but doing so quickly and efficiently. Complex tasks often involve both parallel and sequential steps. A single agent might multitask (e.g., chopping vegetables while water boils), and multiple agents can distribute workloads (e.g., cooking different dishes simultaneously). These forms of parallelism – within an agent and between agents – are vital for overall time efficiency.

Existing benchmarks like AsyncHow, Robotouille, and CookBench have limitations. Some oversimplify tasks, assume unlimited resources, or focus only on single-agent asynchronous planning. Others are too broad to isolate and evaluate time efficiency effectively. This gap motivated the creation of ParaCook.

Introducing ParaCook: A Kitchen for LLM Agents

ParaCook is inspired by the popular ‘Overcooked’ game, setting up a 2D grid-world kitchen environment where multiple agents (cooks) must collaborate to prepare dishes. The environment provides a simplified action space (MoveTo, Interact, Process, Wait, Finish) to ensure the primary challenge is strategic parallel planning, rather than complex low-level actions.

The benchmark instantiates planning concepts through:

Task Decomposition: Recipes are broken down into subtasks with dependencies, forming a Directed Acyclic Graph (DAG).
Time Delay: Accounts for inherent execution delays, like cooking times, allowing for parallel execution.
Parallel Execution: Supports both intra-agent (e.g., one agent managing multiple tasks) and inter-agent (multiple agents working together) parallelism.
Actual Execution Time: Incorporates real-world factors like travel time between workstations.

Tasks are structured around various recipes (simple, intermediate, complex) and orders (combinations of dishes), allowing for controllable concurrency challenges. The kitchen map itself is configurable, enabling dynamic control over size, workstation arrangement, and agent count, which helps in fine-tuning the difficulty and evaluating different levels of parallelism.

Measuring Success: Metrics in ParaCook

ParaCook evaluates agent performance using several key metrics:

Success Rate (SR): Measures the percentage of tasks where all dishes are completed correctly.
Order Completion Time (OCT): The total time elapsed until all orders are finished. This includes penalized (pOCT) and normalized (nOCT) variants to account for failures and compare efficiency among successful runs.
Movement Distance (MD): The average travel distance of all agents, indicating spatial efficiency.
Agent Utilization (AU): The proportion of time agents spend actively working, reflecting coordination quality.

Key Findings: LLMs vs. Humans

The research conducted comprehensive experiments with state-of-the-art LLMs, including GPT-5, Gemini-2.5-Pro, DeepSeek-V3.2-Exp, Claude-Opus-4.1, and Qwen3-Max-Preview. The findings reveal that while LLMs show promise, there’s a significant gap compared to human performance:

LLMs Struggle with Complexity: GPT-5 achieved the highest average success rate (65.0%) but saw significant drops on complex tasks. Other models performed even worse, with some failing almost completely on medium and hard tasks.
Time Efficiency is a Challenge: GPT-5 demonstrated the best time efficiency among LLMs, but all models were substantially slower than human baselines. Inefficient scheduling and failure to parallelize actions directly led to longer completion times and higher movement costs.
Humans Excel: Human participants achieved a perfect 100% success rate across all difficulty levels, demonstrating superior robustness, time efficiency, and spatial optimization. Humans completed tasks faster and with far less movement than even the best LLMs.
Chain-of-Thought (CoT) Prompting: The effectiveness of CoT prompting varied. It amplified performance for strong models like GPT-5 but could destabilize moderately capable ones and offered only limited help to weaker models.
Abstract Potential: Interestingly, when given abstract planning tasks (without environmental interaction), top LLMs achieved near-optimal performance (within 1-7% of the theoretical optimum). This suggests LLMs have strong inherent reasoning capabilities for high-level scheduling, but struggle when translating these plans into concrete actions within an embodied environment.

Also Read:

The Path Forward

ParaCook highlights that while LLMs can generate correct plans for less complex tasks, their execution strategies are not yet as temporally and spatially optimized as those of humans, especially when high coordination demands are present. The contrast between strong performance on abstract tasks and inefficiency in the real ParaCook environment underscores the need for structured approaches, such as hierarchical planning frameworks that separate high-level scheduling from detailed action execution.

ParaCook provides a scalable and adjustable framework for developing and assessing time efficiency-aware multi-agent planning, laying a crucial foundation for advancing LLM agents that can truly exploit concurrency and collaboration in complex, dynamic environments.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ParaCook: A New Benchmark for Time-Efficient Multi-Agent Planning with LLMs

The Need for Time-Efficient Planning

Introducing ParaCook: A Kitchen for LLM Agents

Measuring Success: Metrics in ParaCook

Key Findings: LLMs vs. Humans

The Path Forward

Gen AI News and Updates

MLCommons Unveils MLPerf Training v5.1 Benchmarks, Showcasing Significant AI Performance Gains

Automating the Detection of Modality Bias in Multimodal Misinformation

PRIME: A New Framework to Diagnose AI’s Stereotypical Reasoning

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates